OpenAI gpt-oss-120b

OCI Generative AI supports access to the pretrained OpenAI gpt-oss-120b model.

The openai.gpt-oss-120b is an open-weight, text-only language model designed for powerful reasoning and agentic tasks.

Regions for this Model

Important

For supported regions, endpoint types (on-demand or dedicated AI clusters), and hosting (OCI Generative AI or external calls) for this model, see the Models by Region page. For details about the regions, see the Generative AI Regions page.

Access this Model

Access this model through the Console, API, and the CLI:

Note

The API endpoints for all supported commercial, sovereign, and government regions are listed in the Management API and Inference API links. You can access each model only through its supported regions.

Key Features

Model Name in OCI Generative AI: openai.gpt-oss-120b
Model Size: 117 billion parameters
Text Mode Only: Input text and get a text output. Images and file inputs such as audio, video, and document files aren't supported.
Knowledge: Specialized in advanced reasoning and text-based tasks across a wide range of subjects.
Context Length: 128,000 tokens (maximum prompt + response length is 128,000 tokens for each run). In the playground, the response length is capped at 16,000 tokens for each run.
Excels at These Use Cases: Because of its training data, this model is especially strong in STEM (science, technology, engineering, and mathematics), coding, and general knowledge. Suitable for high-reasoning, production-level tasks.
Function Calling: Yes, through the API.
Has Reasoning: Yes.
Knowledge Cutoff: June 2024

For key feature details, see the OpenAI gpt-oss documentation.

On-Demand Mode


Model Name	OCI Model Name	Pricing Page Product Name
OpenAI gpt-oss-120b	`openai.gpt-oss-120b`	OpenAI - gpt-oss-120b Prices for: Input Tokens Output Tokens

Learn about On-Demand Mode.

Dedicated AI Cluster for the Model

For models in on-demand mode, no clusters are required. Access them through the Console playground and API. For models available in the dedicated mode, use endpoints created on dedicated AI clusters. Learn about the Dedicated Mode.

The following table lists hardware unit sizes, available regions, and service limits for dedicated AI clusters. This model isn't available for fine-tuning.


Hardware Unit Size	Available Regions	Limit Name
OAI_A100_40G_X4	UAE East (Dubai)	Limit Name: `dedicated-unit-a100-40g-count` Request Increase by: 4
OAI_A100_80G_X2	US Midwest (Chicago) US West (Phoenix)	Limit Name: `dedicated-unit-a100-80g-count` Request Increase by: 2
OAI_H100_X2	Brazil East (Sao Paulo) Germany Central (Frankfurt) India South (Hyderabad) Japan Central (Osaka) UK South (London) US East (Ashburn) US Midwest (Chicago)	Limit Name: `dedicated-unit-h100-count` Request Increase by: 2
OAI_H200_X1	Saudi Arabia Central (Riyadh)	Limit Name: `dedicated-unit-h200-count` Request Increase by: 1

Important

For hardware pricing, see the Cost estimator.
If tenancy limits are insufficient for hosting this model on a dedicated AI cluster, request an increase for the relevant hardware limit. For example, request an increase for the dedicated-unit-h100-count limit by 2. See Requesting a Service Limit Increase.

Cluster Performance Benchmarks

Review the OpenAI gpt-oss-120b (New) cluster performance benchmarks for different use cases.

OCI Release and Retirement Dates

For release and retirement dates and replacement model options, see the following pages based on the mode (on-demand or dedicated):

Model Parameters

To change the model responses, you can change the values of the following parameters in the playground or the API.

Maximum output tokens: The maximum number of tokens that you want the model to generate for each response. Estimate four characters per token. Because you're prompting a chat model, the response depends on the prompt and each response doesn't necessarily use up the maximum allocated tokens. The maximum prompt + output length is 128,000 tokens for each run. In the playground, the maximum output tokens is capped at 16,000 tokens for each run.

Tip

For large inputs with difficult problems, set a high value for the maximum output tokens parameter.
Temperature: The level of randomness used to generate the output text. Min: 0, Max: 2, Default: 1

Tip

Start with the temperature set to 0 or less than one, and increase the temperature as you regenerate the prompts for a more creative output. High temperatures can introduce hallucinations and factually incorrect information.
Top p: A sampling method that controls the cumulative probability of the top tokens to consider for the next token. Assign p a decimal number between 0 and 1 for the probability. For example, enter 0.75 for the top 75 percent to be considered. Set p to 1 to consider all tokens. Default: 1
Frequency penalty: A penalty that's assigned to a token when that token appears frequently. High penalties encourage fewer repeated tokens and produce a more random output. Set to 0 to disable. Default: 0
Presence penalty: A penalty that's assigned to each token when it appears in the output to encourage generating outputs with tokens that haven't been used. Set to 0 to disable. Default: 0

Oracle Cloud Infrastructure Documentation