Back

Compare Input vs Output Token Pricing: Why Output Costs More and How to Budget for It

July 2, 2026

9 min read

Amnic

Comparisons

No headings found on page

Output tokens cost more than input tokens on every major model API, usually 2x to 8x more per million tokens. Input is the text you send in, and output is the text the model generates back. The gap exists because the model reads your whole prompt in one parallel pass but writes its answer one token at a time.

That single asymmetry decides most AI bills. If you price a workload by counting total tokens and multiplying by one rate, your estimate can be wrong by 3x or more. This guide shows the current split across providers, works through real examples, and gives you a way to budget for it. First it helps to know what a token in AI actually represents before you attach a price to it.

What Counts as Input vs Output

Input tokens are everything you send into the model on a request. Output tokens are only what it writes back. The two are metered at different rates on the same call, so knowing which bucket each part of your traffic lands in is the first step to a correct estimate.

Input tokens include:

Your prompt and any user message on that turn
The system prompt, billed on every single call, not once per session
Retrieved documents, tool definitions, and few-shot examples
The full conversation history, re-sent as input on every turn of a chat

Output tokens include:

The visible text the model returns
Reasoning or thinking tokens from a thinking model, billed as output even though you never see them

Two more categories hide inside the same bill and change the math. Cached input tokens are prompt tokens the provider already processed and stored, billed at a steep discount on reuse. Reasoning tokens inflate the output side invisibly. Studying how token economics compound across a session explains why input creeps up faster than teams expect.

A quick pass through a token counter before you ship a feature keeps these categories from becoming a monthly surprise. Measure a typical request end to end, split it into input and output, and you have the raw numbers every calculation below depends on.

Why Output Tokens Cost More

The price gap is not arbitrary markup. It comes from how inference runs on a GPU. When you send a prompt, the model processes all input tokens together in a single parallel forward pass called prefill, so the hardware stays busy and cost per input token stays low.

Generating output works differently. The model produces one token, appends it, then runs another full forward pass for the next token, and repeats. A 1,000-token answer means roughly 1,000 sequential passes, bottlenecked by memory bandwidth rather than raw compute. A growing key-value cache also makes batching harder as the response lengthens.

This is why the two directions are priced apart. The mechanics of LLM inference explain why sequential decoding is the expensive half of the bill, and that cost shape flows straight through to your invoice.

Current Input vs Output Pricing by Provider

Most published comparisons still quote an older model generation like GPT-4o and Claude 3.5 Sonnet. The current lineup tells a fresher and wider story. The table below lists standard-tier per-million-token rates from each provider's own pricing pages.

Model	Input ($/M)	Output ($/M)	Output-to-input ratio
DeepSeek V4-Pro	$0.435	$0.87	2.0x
Claude Haiku 4.5	$1.00	$5.00	5.0x
Claude Sonnet	$3.00	$15.00	5.0x
Claude Opus 4.8	$5.00	$25.00	5.0x
GPT-5.4	$2.50	$15.00	6.0x
GPT-5.5	$5.00	$30.00	6.0x
Gemini 2.5 Pro	$1.25	$10.00	8.0x
Gemini 2.5 Flash	$0.30	$2.50	8.3x

The ratio clusters around 5x for Anthropic and 6x for OpenAI, drops to 2x for DeepSeek, and widens past 8x for Gemini. So there is no universal multiplier, and the model you pick changes the shape of your bill as much as the volume you run.

If you want to weigh models side by side, an LLM cost comparison sets the full rate cards next to each other. A lower headline input rate can still lose to a steep output rate once your workload leans toward generation, so the split decides more than the sticker price does.

For a single provider's full tiering, the OpenAI API pricing breakdown covers cached, batch, and long-context rates. Hosting matters too, since running Claude or Llama through a managed service adds its own layer, which is where Amazon Bedrock cost monitoring tools come in.

Three Real Examples With Calculations

The fastest way to see the asymmetry is to price three common workloads at their real rates. Each uses a different model and a different input-to-output balance, and each shows how the split, not the token count, drives the total.

Example 1: Support chatbot on Claude Sonnet ($3 in / $15 out):
Each query sends 2,000 input tokens and returns 300 output tokens, across 100,000 queries a month. Input costs 200M x $3 = $600. Output costs 30M x $15 = $450. The bot sends almost 7x more tokens than it writes, yet output is still 43% of the $1,050 bill.

Example 2: RAG summarizer on Gemini 2.5 Flash ($0.30 in / $2.50 out):
Each run reads 20,000 input tokens and writes 500 output tokens, across 50,000 runs. Input costs 1,000M x $0.30 = $300. Output costs 25M x $2.50 = $62.50. This is input-dominated work: output is just 17% of the $362.50 total.

Example 3: Coding agent on GPT-5.4 ($2.50 in / $15 out):
Each task reads 3,000 input tokens and generates 4,000 output tokens, across 20,000 tasks. Input costs 60M x $2.50 = $150. Output costs 80M x $15 = $1,200. Output is 89% of the $1,350 bill even though it is barely more than half the tokens.

Workload	Model	In / Out per call	Volume	Input $	Output $	Total	Output share
Support chatbot	Claude Sonnet	2,000 / 300	100k	$600	$450	$1,050	43%
RAG summarizer	Gemini 2.5 Flash	20,000 / 500	50k	$300	$62.50	$362.50	17%
Coding agent	GPT-5.4	3,000 / 4,000	20k	$150	$1,200	$1,350	89%

How to Actually Calculate Your Cost

A common mistake is to add input and output tokens together and apply one price. Say a model charges $1.25 per million input and $10.00 per million output. One million tokens total is not a single blended figure, because it depends entirely on the split.

Send 800,000 input and 200,000 output: that is $1.00 plus $2.00, so $3.00. Flip it to 200,000 input and 800,000 output: that is $0.25 plus $8.00, so $8.25. Same million tokens, but the second workload costs 2.75x more.

The Gemini API pricing page shows this directional split clearly, and the same logic applies to every model you run. Always price the two directions separately, then add them, and never reason from a single blended rate.

The Part Nobody Tells You: Workload Mix

Because output is the expensive direction, a workload's input-to-output ratio predicts its cost better than the token count does. Sorting your traffic by that ratio tells you where the money goes before you write a line of optimization code.

Input-heavy (cheap): retrieval-augmented generation, summarization, classification, extraction. Lots of context in, little text out.
Balanced: chat assistants, Q&A, translation. Moderate on both sides.
Output-heavy (expensive): code generation, long-form writing, agent loops, reasoning tasks. Little in, lots out.

Two features with identical total token counts can cost very differently. The examples above prove it: the summarizer and the coding agent are worlds apart on the output share. Classifying workloads this way feeds directly into how to optimize LLM cost once you know which side dominates.

Input vs Output Is a Unit-Economics Problem

Most guides stop at "write shorter prompts." That misses the point for anyone running a business on these APIs. The asymmetry means a blended cost-per-token number hides your true margin, because different features consume tokens in opposite directions.

A generous free tier built on an output-heavy feature can quietly erode gross margin while an input-heavy feature on the same plan looks profitable. To see this you must attribute input, output, and cached tokens separately to each feature, customer, and team.

That is a cost allocation exercise, and it separates knowing your average token cost from knowing your per-customer cost of goods sold. This discipline of treating tokens as a metered resource with owners is what teams increasingly call TokenOps.

It is also the ground that a dedicated FinOps for AI practice is built to hold. Amnic tracks input, output, and cached tokens as distinct metrics across providers and attributes them to cost centers, so you can see which feature drives the output-heavy spend.

Amnic reports and allocates that usage rather than telling you to switch models, which keeps engineering decisions with the people who own the product. Purpose-built LLM cost allocation tools make the per-token split visible instead of leaving it averaged inside one dashboard figure.

The Hidden Multipliers: Reasoning and Cached Tokens

Two mechanics can swing your effective ratio well beyond the sticker price. Reasoning tokens from thinking models are billed as output even though the response never shows them, so a model with a modest listed ratio can bill like a far more expensive one on hard problems.

Practitioners consistently report bills several times higher than a naive estimate on reasoning-heavy tasks, and dashboards often omit these tokens, which makes reconciliation hard. Watching for them is part of any serious Gemini cost visibility tools setup.

Cached input pushes the other way. Providers charge roughly 90% less to reread a cached prompt than to process it fresh, though the first write to cache costs about 25% more than a normal input token. DeepSeek illustrates the extreme: its cache-hit rate is a fraction of a cent per million tokens.

Getting caching right, and confirming it actually fires, is one of the largest levers on the input side. It pairs with the ideas in how to reduce inference cost for the output side.

Managing the Split

You control the bill by acting on the expensive direction first. The levers below map directly to the two meters, and most teams find the output-side ones move the number fastest.

Measure per feature, not per account, so you know which workloads sit on the expensive side.
Cap output length where verbosity adds no value, since output is where the multiplier bites.
Cache stable context like system prompts and retrieved documents, then verify the hit rate rather than assuming it works.
Route by ratio, sending input-heavy and output-heavy jobs to models whose pricing suits them.
Watch hosted spend, since running models through a platform adds a layer that Amazon Bedrock cost optimization tools help contain.

A dedicated AI token management view keeps all of this in one place instead of scattered across provider consoles. Standardizing on one source of truth is what makes the levers repeatable across teams.

Conclusion

Input and output tokens are two separate meters, and output is the one that runs fast. The current spread runs from about 2x on DeepSeek to more than 8x on Gemini, so the model you choose and the direction your workload leans decide the bill together.

Price the two directions apart, classify your workloads by their input-output ratio, and allocate the split to features and customers. Fold it into your wider FinOps practice, and the asymmetry stops being a billing surprise and becomes a lever you control.

FAQs

If I use one million tokens, isn't my cost just the token count times one price?

No. Input and output are billed at different rates on the same request, and output usually costs 2x to 8x more. Your cost depends on how the million splits between the two directions, not the total.

Why do output tokens cost more than input tokens?

Input is processed in one parallel forward pass, so it is cheap per token. Output is generated one token at a time, each needing a full pass, which uses far more compute and memory bandwidth per token.

Does my conversation history and system prompt count as input on every call?

Yes. The full context, including history and the system prompt, is re-sent and re-billed as input on every turn unless you cache or trim it. Long chats quietly grow the input bill.

Do reasoning or thinking tokens count as input or output?

They are billed as output. The model generates them as intermediate steps before the visible answer, and they are usually hidden from the response, so they inflate output cost without showing up in what you read.

Are cached input tokens cheaper, and by how much?

Yes. Rereading a cached prompt typically costs around 90% less than processing it fresh. The catch is that the first write to cache costs about 25% more than a normal input token.

Which workloads are cheap and which are expensive?

Input-heavy jobs like summarization and retrieval are cheap because they return little text. Output-heavy jobs like code generation and agent loops are expensive because they ride the higher output rate.

Better visibility and management into AI Tokens?

Start with a 30 day trial

Connect leading LLMs

24 hour time to value

Stay ahead of AI Spend

Request a Demo