Back

OpenAI API Pricing Explained: How to Estimate and Control Your Token Costs

June 29, 2026

8 min read

Amnic

AI and LLM costs

FinOps for AI

No headings found on page

A rate card looks simple until your finance team forwards the first five-figure invoice. OpenAI publishes per-token prices that fit on a single page, yet teams shipping production features still get surprised every month. The reason is rarely the posted rate.

It is everything the rate card does not show: which model your code actually called, how many reasoning tokens the answer burned, whether a stray retry loop doubled your input and which customer or feature absorbed the cost.

This guide breaks down current OpenAI API pricing the way a FinOps for AI practitioner would read it. Not as a list of numbers, but as a model of where your money actually goes once code is in production.

What You Actually Pay For

OpenAI bills the API by tokens, not requests, words or minutes. A token is a chunk of text, roughly four characters or three-quarters of an English word on average, per the official tokenization guide.

Every call has two billable sides:

Input tokens: Your system prompt, conversation history, retrieved context, function definitions and the user message. Anything you send to the model.
Output tokens: What the model writes back. For reasoning models like the o-series this also includes hidden reasoning tokens that the model generates internally and you still pay for, per the reasoning docs.

Output is the expensive side. Across the GPT-5 family, output runs six times input on the flagship and four times on the mini and nano tiers, per the pricing page. The ratio is steeper than Anthropic's roughly 5× constant, which means model-tier choice swings the bill harder on OpenAI than on Claude.

The same asymmetry exists on Anthropic's Claude rate card, but the multiplier sits at a different ratio and the tokenizer counts text differently. A full LLM API pricing comparison is the only honest way to forecast across vendors.

Current OpenAI API Pricing by Model

Pricing per 1M tokens, sourced from the developer pricing docs and verified against an independent model index:

Model	Tier	Input	Cached input	Output
GPT-5.5	Flagship	$5.00	$0.50	$30.00
GPT-5.5 pro	Pro reasoning	$30.00	n/a	$180.00
GPT-5.4	Workhorse	$2.50	$0.25	$15.00
GPT-5.4 mini	Mini	$0.75	$0.075	$4.50
GPT-5.4 nano	Nano	$0.20	$0.02	$1.25
o3	Reasoning	$2.00	$0.50	$8.00
o4 mini	Reasoning mini	$0.55	$0.28	$2.20
GPT-4o	Legacy multimodal	$2.50	$1.25	$10.00

Pricing last verified: June 2026

Two patterns to read into this table before you pick a model:

The spread between nano and pro is roughly 150×. GPT-5.4 nano starts at $0.20 input while GPT-5.5 pro reaches $30 input for the same volume. Model choice is the largest lever you have.
Cached input is roughly 90% cheaper than fresh input across the standard tiers, per the prompt caching announcement. On any workload with a stable system prompt the real input rate is closer to the cache column than the headline column.

Multimodal endpoints sit outside this table. Realtime audio runs at $32 per 1M audio input tokens and $64 per 1M audio output tokens, transcription via gpt-4o-transcribe at roughly $0.006 per minute and image generation through gpt-image models is billed by image and tile.

The Cost Formula and Three Worked Examples

The math itself is trivial:

cost per call = (input_tokens / 1,000,000 × input_price) + (output_tokens / 1,000,000 × output_price)

What is not trivial is what the inputs to that formula actually look like in production. Three scenarios.

Example 1: Customer support chatbot on GPT-5.4 mini

400-token system prompt, 100-token user message, 300-token reply. 10,000 conversations per day.

Input: 500 × 10,000 = 5M tokens/day × $0.75 = $3.75/day
Output: 300 × 10,000 = 3M tokens/day × $4.50 = $13.50/day
Daily total: $17.25. Monthly: about $518.

Same scenario on GPT-5.5 instead: roughly $4,950 per month. Nine and a half times more for the same conversation, before any quality measurement.

Example 2: RAG knowledge assistant on GPT-5.4

8,000 tokens of retrieved context plus a 200-token question, returning a 600-token answer. 2,000 queries per day.

Input: 8,200 × 2,000 = 16.4M tokens/day × $2.50 = $41/day
Output: 600 × 2,000 = 1.2M tokens/day × $15 = $18/day
Daily total: $59. Monthly: about $1,770.

Here the input side dominates because of context length. Prompt caching the retrieved context drops the effective input rate to roughly $0.25, pulling monthly spend below $700 if cache hit rate is high.

Example 3: High-volume classification on nano plus Batch

A 300-token input classifying 500,000 records per day into 50-token labels. Run through the Batch API at 50% off.

Input: 150M tokens/day × ($0.20 × 0.5) = $15/day
Output: 25M tokens/day × ($1.25 × 0.5) = $15.63/day
Daily total: about $31. Monthly: about $925 for half a billion classifications.

Same workload on GPT-5.4 standard pricing: roughly $24,000/month. Model choice plus Batch is a 25× swing on identical functional output.

Why Your Bill Outruns the Rate Card

The rate card is honest. What inflates real invoices is usually one of these.

Reasoning tokens you cannot see: o-series models generate internal reasoning tokens before producing the visible answer. They are billed as output but never returned. A 300-token answer can carry 2,000 reasoning tokens behind it. On o3 at $8 output that hidden tail dominates.

Vision and audio token math: An image input is converted to tokens based on resolution and detail mode. A single high-detail 1024×1024 image consumes roughly 765 tokens on GPT-4o, per the vision documentation. Realtime audio is denominated in audio tokens at the $32/$64 rates above. Teams budgeting against text-token assumptions get this wrong every time.

Output cascading into input: Multi-turn conversations, agent loops and chain-of-thought patterns turn last turn's output into next turn's input. The cost compounds. A five-step agent that emits 1,000 tokens per step is paying input on 1,000, then 2,000, then 3,000, then 4,000 tokens of accumulated context.

Model drift: Defaults change. A feature shipped against GPT-4o mini quietly upgrades to GPT-5.4 when the SDK refreshes and a 12× output multiplier lands in your bill the same week. This is also why where you host the model matters: Azure and Bedrock add a regional, contractual layer that the rate card does not show. On Azure OpenAI specifically, support plans, egress and observability can push the effective cost well above the direct API rate.

Retries and idle context: Failed JSON parses, tool-call timeouts and reconnects all re-send the prompt. Every retry pays full input.

For teams running self-hosted alternatives alongside the API the same dynamics show up in GPU spend on training and inference. Different unit, same compounding behavior.

Ways to Cut OpenAI API Costs Without Cutting Quality

Five levers, ranked by realistic impact.

Route by task, not by default: Send classification, extraction and structured-output jobs to nano or mini. Reserve GPT-5.5 and o3 for tasks that genuinely need reasoning. Most production traffic is over-provisioned by one or two tiers.

Make prompt caching automatic: Caching kicks in on prompts above 1,024 tokens and applies to the longest shared prefix, per the caching guide. Put your stable system prompt and tools definition first and the user-specific content last. Cached input is roughly 90% cheaper.

Use Batch and Flex for anything not real-time: Batch gives 50% off with a 24-hour SLA. Flex offers the same 50% discount with variable latency on supported models. Overnight evaluations, embeddings backfills, content classification: almost none of it needs sub-second response.

Cap output, not input: Set max_output_tokens based on what the downstream UI actually renders. A 4,000-token ceiling on a feature that displays 200 tokens is a 20× tax on every call.

Forecast and budget against unit economics, not vibes: OpenAI's dashboard shows tokens by API key, but it does not show cost per customer, per feature or per workflow. Reading cloud cost forecasting strategies for AI workloads is closer to the right mental model than monthly spend graphs.

From Rate Card to Unit Economics

A rate card tells you the price of a token. It does not tell you which customer, feature or release is burning your OpenAI budget.

This is the gap Amnic closes. Amnic's AI token cost management attributes every API call to the customer, feature, environment and team it served, surfaces model-routing drift before it hits the invoice and forecasts spend per unit of business value. Not just per million tokens.

For teams that want to evaluate options first, the broader landscape is covered in our guide to dedicated FinOps tools for AI workloads.

Key Takeaways

OpenAI API pricing is per token, split between input and output, with output running four to six times more expensive than input.
The spread across the model lineup is about 150×. Picking the wrong tier is the single most expensive mistake you can make.
Cached input is roughly 90% cheaper than fresh input. Structure prompts to maximize cache hits.
Batch and Flex deliver 50% off for non-real-time workloads. Most production traffic qualifies.
Real bills outrun rate cards because of reasoning tokens, vision-token math, output cascade, model drift and retries.
Cost per million tokens is the wrong unit. Cost per customer, per feature, per workflow is the unit that lets you actually decide what to ship.

Frequently Asked Questions

What is the cheapest OpenAI API model?

GPT-5.4 nano at $0.20 input and $1.25 output per 1M tokens is the lowest standard tier. Open-weight variants list lower.

How is OpenAI API pricing calculated?

Cost equals input tokens times input price plus output tokens times output price, both divided by one million.

Why is OpenAI API output more expensive than input?

Output is generated token by token while input is consumed in parallel. On OpenAI specifically the ratio is 6× on GPT-5.5 flagship and 4× on mini and nano. Reasoning models like o3 widen the gap further.

What is a token in OpenAI pricing?

A token is a chunk of text roughly equal to four characters or three-quarters of a word in English.

Does OpenAI charge for cached prompts?

Yes, but at roughly 90% off standard input pricing for the cached prefix. Caching is automatic on prompts above 1,024 tokens.

How much discount does the Batch API give?

50% off both input and output, with a 24-hour completion window.

What is Flex pricing on the OpenAI API?

Flex processing matches the Batch 50% discount but runs synchronously with variable latency and possible resource-unavailable responses on supported models.

Do reasoning tokens cost extra on o-series models?

Reasoning tokens are billed as output tokens even though they are not returned in the response, which is why a short visible answer can carry a long invisible cost.

How do I estimate my OpenAI API monthly bill?

Estimate average input and output tokens per call, multiply by daily call volume, apply the per-million rate for your chosen model and multiply by 30. Sanity check against your usage dashboard after week one.

How can I control OpenAI API costs in production?

Route by task to the lowest viable model, structure prompts for cache hits, push non-real-time work to Batch or Flex, cap output tokens and attribute spend to the customer or feature that drove it.

Is OpenAI API cheaper than ChatGPT Plus?

ChatGPT Plus is a flat $20 per month for one user. The API is per-token usage-based, so it is cheaper for low-volume use and more expensive at scale. They serve different buyers.

Better visibility and management into AI Tokens?

Start with a 30 day trial

Connect leading LLMs

24 hour time to value

Stay ahead of AI Spend

Request a Demo