OpenAI gpt-oss-120b (New)

Review performance benchmarks for the openai.gpt-oss-120b (OpenAI gpt-oss-120b ) model hosted on one OAI_H100_X2 unit of a dedicated AI cluster, (two H100 GPUs) in OCI Generative AI.

  • See details for the model and review the following sections:
    • Available regions for this model.
    • Dedicated AI cluster unit size for hosting this model.
  • Review the metrics.

Random Length

This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. Because of the unknown prompt and response lengths, we've used a stochastic approach where both the prompt and response length follow a normal distribution. The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens. The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 0.22 261.95 220.15 1.36 0.73 548.46
2 0.09 213.71 398.60 1.51 1.32 1,033.10
4 0.08 165.27 625.81 1.90 2.08 1,622.33
8 0.18 119.84 862.00 2.62 3.01 2,314.16
16 0.17 93.47 1,343.22 3.38 4.54 3,470.47
32 0.59 63.39 1,596.95 5.27 5.66 4,281.85
64 0.62 37.63 1,795.69 8.87 6.31 4,772.03
128 1.10 23.71 2,180.46 12.86 7.99 5,952.25
256 1.78 18.58 2,222.52 15.93 9.35 6,504.76

Chat

This scenario covers chat and dialog use cases where the prompt and responses are short. The prompt and response length are each fixed to 100 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 0.07 261.23 222.83 0.45 2.23 442.72
2 0.13 223.12 346.88 0.57 3.47 689.14
4 0.14 185.16 583.72 0.68 5.84 1,159.68
8 0.15 150.14 948.79 0.81 9.49 1,884.99
16 0.17 131.49 1,598.00 0.92 15.98 3,175.39
32 0.75 99.64 1,711.51 1.79 17.12 3,399.46
64 0.87 81.13 2,627.13 2.10 26.27 5,219.70
128 1.89 54.58 2,976.36 3.78 29.76 5,911.72
256 2.07 31.58 3,852.55 5.37 38.53 7,653.63

Generation Heavy

This scenario is for generation and model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, the prompt length is fixed to 100 tokens and the response length is fixed to 1,000 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 0.07 261.81 256.51 3.89 0.26 281.77
2 0.14 224.09 434.47 4.60 0.43 477.39
4 0.14 182.54 710.36 5.62 0.71 780.48
8 0.15 144.99 1,129.12 7.04 1.13 1,240.52
16 0.27 124.21 1,908.32 8.31 1.91 2,096.68
32 0.60 101.42 3,023.03 10.45 3.02 3,321.46
64 0.84 81.18 4,740.88 13.15 4.74 5,208.51
128 1.28 62.05 7,107.26 17.38 7.11 7,808.01
256 1.60 42.80 9,691.73 24.98 9.69 10,647.89

RAG Scenario 1

The retrieval-augmented generation (RAG) scenario has a large input and a short response such as summarizing use cases. In this scenario, the input length is fixed to 2,000 tokens and the response length is fixed to 200 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 0.09 257.79 226.90 0.86 1.13 2,460.63
2 0.16 219.97 367.54 1.07 1.84 3,985.78
4 0.29 181.60 555.09 1.39 2.78 6,019.49
8 0.46 141.97 810.00 1.87 4.05 8,784.09
16 0.60 112.00 1,196.86 2.43 5.98 12,981.05
32 0.97 79.31 1,576.52 3.56 7.88 17,096.63
64 1.74 57.86 1,973.83 5.28 9.87 21,404.97
128 3.45 33.18 2,025.35 9.74 10.13 21,963.02
256 6.73 20.00 2,109.05 17.30 10.55 22,872.85

RAG Scenario 2

The retrieval-augmented generation (RAG) scenario has a large input and a short response such as summarizing use cases. In this scenario, the input length is fixed to 7,800 tokens and the response length is fixed to 200 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 0.16 251.46 200.36 0.95 1.00 7,892.53
2 0.25 198.46 303.35 1.26 1.52 11,948.77
4 0.46 147.81 410.55 1.84 2.05 16,172.35
8 0.89 104.17 513.98 2.88 2.57 20,246.87
16 1.68 75.21 632.25 4.53 3.16 24,904.43
32 3.13 49.11 725.39 7.57 3.63 28,573.61
64 6.12 27.93 745.21 14.10 3.73 29,354.61
128 10.91 16.76 824.68 23.41 4.12 32,484.31
256 28.27 23.06 878.43 37.83 4.39 34,600.76

RAG Scenario 3

The retrieval-augmented generation (RAG) scenario has a large input and a short response such as summarizing use cases. In this scenario, the input length is fixed to 128,000 tokens and the response length is fixed to 200 tokens.

Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per second) (RPS) Total Throughput (tokens/second)
1 4.90 193.51 30.28 5.93 0.15 19,105.23
2 7.54 76.41 32.29 11.57 0.16 20,375.68
4 12.76 34.53 33.60 22.17 0.17 21,197.94
8 26.85 26.84 38.26 38.94 0.19 24,138.69
16 65.93 26.53 38.17 78.04 0.19 24,087.24
32 139.44 26.72 37.81 151.55 0.19 23,857.98
64 268.60 26.67 36.95 280.69 0.18 23,314.67
128 451.10 26.82 35.89 463.13 0.18 22,643.98
256 592.11 26.18 32.51 604.78 0.16 20,515.68