Back

LLM Cost Comparison: OpenAI vs Anthropic vs Gemini vs Mistral API Costs Compared

June 8, 2026

8 min read

Amnic

Comparisons

AI and LLM costs

No headings found on page

An LLM cost comparison ranks model APIs by what you actually pay per million tokens, split into input and output rates. The cheapest published rate rarely wins. Your real bill depends on how many output tokens a model generates, whether you use prompt caching or batch discounts, and how each provider counts a token. List price is the starting line, not the finish.

This guide puts the leading commercial providers side by side, from the four majors to fast-growing challengers like Grok, DeepSeek and Perplexity Sonar, explains the cost factors that the rate cards hide, and shows how to pick the cheapest model for a given workload. That decision sits at the center of AI cost management within a FinOps practice, where token spend gets the same scrutiny as cloud spend.

LLM Pricing Comparison Table

Providers price most models per million tokens, with input charged separately from output. The table below uses current published rates for the flagship and budget tiers of each provider.

Model	Input $/1M	Output $/1M	Cached input $/1M	Context	Best for
GPT-5.4	$2.50	$15.00	$0.25	272K	Complex reasoning and agentic tool use
GPT-5.4 mini	$0.75	$4.50	$0.075	400K	High-volume chat at lower cost
Claude Opus 4.8	$5.00	$25.00	$0.50	1M	Hardest reasoning, long agent runs
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M	Balanced coding and analysis
Claude Haiku 4.5	$1.00	$5.00	$0.10	200K	Fast classification and routing
Gemini 3.1 Pro	$2.00	$12.00	$0.20	1M+	Long-context and multimodal docs
Gemini 3.5 Flash	$1.50	$9.00	$0.15	1M	Balanced speed at scale
Gemini 3.1 Flash-Lite	$0.25	$1.50	$0.025	1M	Ultra-cheap high-volume tasks
Mistral Medium 3.5	$1.50	$7.50	n/a	128K	EU-hosted general workloads
Mistral Large 3	$0.50	$1.50	n/a	128K	Low-cost output, self-host option
Mistral Small 4	$0.10	$0.30	n/a	128K	Cheapest open-weight tasks
Grok 4.3	$1.25	$2.50	$0.20	1M	Real-time reasoning, low output cost
Grok Build 0.1	$1.00	$2.00	$0.20	256K	Code generation and app builds
DeepSeek V4 Pro	$0.44	$0.87	$0.004	1M	Cheap frontier-grade reasoning
DeepSeek V4 Flash	$0.14	$0.28	$0.003	1M	Cheapest reasoning at scale
Perplexity Sonar	$1.00	$1.00	n/a	128K	Answers grounded in live web search
Perplexity Sonar Pro	$3.00	$15.00	n/a	128K	Cited research, deeper retrieval
Perplexity Sonar Reasoning Pro	$2.00	$8.00	n/a	128K	Multi-step reasoning over web data
Perplexity Sonar Deep Research	$2.00	$8.00	n/a	128K	Autonomous long-form research

Rates verified against official pricing documentation for OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek and Perplexity. DeepSeek V4 Pro reflects current promotional pricing, and the Sonar models add a search fee of $5 to $14 per thousand queries on top of these token rates.

Across these providers, input prices span a 50x range, from $0.10 to $5.00 per million tokens. That gap is why a single comparison table cannot pick a winner for you. The right model is the one whose price profile matches your token mix.

The challengers reset the floor. DeepSeek V4 Flash prices reasoning-grade output at $0.28 per million tokens, a fraction of the majors, while V4 Pro stays under $1. Grok 4.3 brings a 1M-token context and $0.20 cached input at a mid-tier rate. Perplexity Sonar folds live web search into the call, but its search fee of $5 to $14 per thousand queries lands on top of tokens, so the sticker rate alone understates the bill.

Why List Price Misleads

Most teams read the input rate, sort low to high, and stop. That misses where the money goes. Five factors decide your effective cost.

Output tokens cost more, often much more: Output runs roughly 2 to 5 times the input rate because each generated token needs a full pass through the model, as token-economics research shows. A content-generation task with short prompts and long answers behaves nothing like a summarization task with long inputs and short answers, even at the same total token count.

Reasoning tokens bill as output: Models that think before answering charge that hidden thinking at the output rate. In one developer cost breakdown, a $5 estimate for a million tokens turned into a $20 charge because reasoning tokens were counted as output. Agentic workloads make this worse, consuming 5 to 30 times more tokens per task than a standard chatbot, per Gartner-cited analysis.

A token is not a token across vendors: Each provider uses its own tokenizer, so identical text splits into different counts. English prose differs by low single digits, code by 10% to 20%. The provider pricing documentation notes that Claude Opus 4.7 and later use a new tokenizer that may consume up to 35% more tokens for the same text. A model that looks cheaper per token can cost more per request.

Caching and batch change the math: Prompt caching can cut repeat input by up to 90%, and asynchronous batch endpoints halve both input and output on work that does not need a live response. A RAG app with a static system prompt can shift most of its traffic onto cached rates.

Context tiers add surcharges: Some providers raise the rate above a context threshold. Long-document workloads can quietly cross that line and pay more per token than the headline suggests.

How to Compare LLM Costs the Right Way

Skip the per-token sort. Compare on cost per completed task instead.

First, define a unit of work, such as one support reply or one document summary. Measure the input and output tokens that unit consumes on each model, since the same task uses different token counts per model. Then apply a simple monthly estimate:

Monthly cost = ((requests x avg input tokens x input rate) + (requests x avg output tokens x output rate)) / 1,000,000

Weight the result by your real input-to-output ratio. RAG is input-heavy, so caching and input price dominate. Chat and agents are output-heavy, so the output rate and any reasoning overhead dominate.

Factor in quality too. A cheap model that needs human review on 15% of outputs can cost more per finished task than a pricier model with a 3% rework rate. The value question is cost per acceptable answer, not cost per token.

Worked Cost Examples: Where the Rankings Flip

The same model can be the cheapest or the most expensive option depending on the workload. Three scenarios, costed from the rates above, show why a single ranking is useless.

RAG support bot (input-heavy): Each query sends 6,000 input tokens, of which 5,000 are a static system prompt, and returns 400 output tokens, across 100,000 queries a month. On Gemini 3.1 Flash-Lite that is roughly 600M input and 40M output, about $210 a month. Cache the static prompt and that portion drops to the $0.025 cached rate, pulling the total under $100. Caching, not the sticker price, is the lever here. The identical workload on Claude Sonnet 4.6 runs past $2,400, an 11x swing for the same work.

Coding agent (reasoning-heavy): A task sends 8,000 input tokens and returns a 1,000-token answer, but the model also generates 12,000 hidden reasoning tokens billed as output. On Claude Opus 4.8 the real cost is about $0.37 a task, not the $0.07 a naive 1,000-token estimate implies. Across 50,000 tasks a month that gap is $18,000 versus $3,000. The reasoning multiplier, not the headline rate, sets the bill, which is why a mid tier with capped output often beats a flagship that reasons at length.

Batch summarization (offline): Summarizing 200,000 documents at 2,000 input and 300 output each costs about $190 on Gemini 3.1 Flash-Lite at standard rates. Route it through the batch endpoint and the same job halves to roughly $95, because nothing needs a live response.

The pattern across all three: a cheap classifier in front of a flagship, plus caching and batch where they apply, routinely cuts a production bill by more than half. For platform-level choices about where to run these models, the OpenAI vs Bedrock vs Vertex AI comparison covers hosting venue rather than raw token price.

Provider Snapshots

Each provider owns a different slice of the price-to-capability curve. The links below open the full rate card for that vendor.

OpenAI anchors the premium end with broad model coverage and deep tooling. The full breakdown of OpenAI API pricing lists its flagship, mini and nano tiers side by side.
Anthropic competes at the high end on long context and caching, with Claude Opus, Sonnet and Haiku. The tier-by-tier Anthropic API pricing page covers every rate.
Google Gemini runs the widest price spread, from Flash-Lite at the floor to Pro at the top, with strong long-context options. The full Gemini API pricing breakdown lists every tier.
Mistral undercuts most rivals on output, with Large 3 at $1.50 output per million tokens. The Mistral API pricing page covers its open and commercial models.

Why Your Real Bill Differs From the Table

List price tells you the rate. It does not tell you the bill, because the bill depends on usage you usually cannot see. Output volume, reasoning overhead, retries, and which feature calls which model all compounds, and most teams track none of it per model.

The data shows the gap. Among FinOps practitioners, 98% now manage AI spend, up from 31% two years earlier, according to the State of FinOps survey. And 80% of companies miss their AI infrastructure cost forecasts by more than 25%, per an inference-cost study. The problem is rarely the rate card. It is the missing line of sight.

This is where token-level visibility matters. Amnic treats AI providers the way it treats AWS, Azure and GCP, giving you usage and token tracking per model and letting you allocate that spend to the product lines and features driving it.

You see which model burns the budget before the invoice lands, the same way you watch cloud spend. Explore how AI token management brings model spend into one view, or compare the broader category in this guide to FinOps tools for AI cost management.

Conclusion

An LLM cost comparison starts with the rate card and ends with your token mix. Read input and output separately, account for caching, batch and reasoning tokens, and measure cost per completed task rather than cost per token. Then pick the cheapest model that clears your quality bar for each workload. Once models are in production, the deciding factor is no longer the published price. It is whether you can see what each model actually spends.

FAQs

Which LLM API is cheapest?

By raw input rate, budget tiers like Mistral Small and Gemini Flash-Lite sit near $0.10 per million input tokens, with DeepSeek among the cheapest for reasoning work. The cheapest for your case depends on your output volume and quality bar, since output runs 2 to 5 times the input rate.

How do I compare LLM costs across providers?

Define one unit of work, measure the input and output tokens each model uses for it, then multiply by each provider's rates. Compare cost per completed task, not cost per token, because tokenizers and output lengths differ between providers.

Why is my LLM bill higher than the listed price?

Output and reasoning tokens bill at the higher output rate, retries add calls, and each provider counts tokens differently. These compound beyond the headline input rate, so usage you cannot see often drives most of the cost.

Are input and output tokens priced the same?

No. Output tokens cost roughly 2 to 5 times more than input, because each generated token requires a full pass through the model. Output-heavy tasks like chat cost more than input-heavy tasks like summarization at the same token total.

What drives LLM API costs up?

Output and reasoning tokens billed at the higher output rate, long context windows, uncached repeat prompts, and retries on low-quality answers. Caching, batch endpoints, shorter prompts, and matching the model tier to the task all counter these drivers.

Better visibility and management into AI Tokens?

Start with a 30 day trial

Connect leading LLMs

24 hour time to value

Stay ahead of AI Spend

Request a Demo