Gemini API Pricing Explained: How to Estimate and Control Your Token Costs

8 min read

Amnic

Amnic

AI and LLM costs

Table of Contents

No headings found on page

Google publishes a Gemini rate card that looks cheaper than every other frontier provider. Then the first production bill lands, and engineering, finance, and the platform team all read different numbers off the same invoice. The headline rate is honest. 

What it hides is the context window your prompts actually crossed, the modality your audio chunks were billed under, the platform your team was routed through, and the grounding calls a single agent quietly ran twenty times per session.

This guide reads Gemini API pricing the way a FinOps for AI practitioner would read it. Not as a list of numbers, but as a model of where your money actually goes once a workload is live.

What You Actually Pay For

Gemini bills the API by tokens, not per-request like AWS API Gateway pricing, and not by words or characters. A token is roughly four characters of English text or one short word, per Google's tokenization documentation. Every call has two billable sides:

  • Input tokens: Your system instructions, conversation history, retrieved context, tool definitions, attached images, audio, and video. Everything you send.

  • Output tokens: What the model writes back.

Output is the expensive side on every Gemini tier. Across the current family, output runs four to six times input, per Google's developer pricing page. That ratio makes prompt design and output capping the two highest-impact levers on the bill.

Three things make Gemini billing distinct from OpenAI API pricing and Anthropic API pricing:

  1. A context-length surcharge. Two flagship tiers double in price above 200K input tokens.

  2. Per-modality rates. Audio input is billed separately from text, image, and video on several models.

  3. A genuinely usable free tier with rate limits instead of a trial credit.

Each of those is a place real bills diverge from the headline number.

Current Gemini API Pricing by Model

Pricing per 1M tokens, sourced from Google's official Gemini API pricing page:

Model

Tier

Input (≤200K)

Input (>200K)

Output (≤200K)

Output (>200K)

Gemini 3.1 Pro Preview

Flagship reasoning

$2.00

$4.00

$12.00

$18.00

Gemini 3.5 Flash

Workhorse

$1.50

$1.50

$9.00

$9.00

Gemini 3.1 Flash-Lite

Cost tier

$0.25

$0.25

$1.50

$1.50

Gemini 2.5 Pro

Prior flagship

$1.25

$2.50

$10.00

$15.00

Gemini 2.5 Flash

Prior workhorse

$0.30

$0.30

$2.50

$2.50

Gemini 2.5 Flash-Lite

Cheapest

$0.10

$0.10

$0.40

$0.40

Three patterns to read into this table before you pick a model:

  1. The spread between Flash-Lite and Pro Preview is roughly 30× on output, per Google's rate card. Model selection is the single biggest cost lever you have.

  2. The 200K breakpoint only bites the Pro tiers. Flash and Flash-Lite charge the same rate at any context length. A long-context workload on Pro that crosses 200K silently jumps to roughly 1.5× the headline rate.

  3. Audio input is priced separately on several models. On Gemini 3.1 Flash-Lite, audio input is $0.50 per 1M tokens against $0.25 for text, image, and video, per Google's pricing page.

Multimodal generation sits outside this table. Image generation through gemini-3.1-flash-image ranges roughly $0.045 to $0.151 per image depending on resolution, and Veo 3.1 video runs $0.05 to $0.60 per second, per Google's pricing page.

The Cost Formula and Three Worked Examples

The math itself is trivial:

cost per call = (input_tokens / 1,000,000 × input_price)

              + (output_tokens / 1,000,000 × output_price)

What is not trivial is what the inputs to that formula look like in production.

Example 1: Customer support chatbot on Gemini 2.5 Flash

400-token system prompt, 100-token user message, 300-token reply. 10,000 conversations per day.

  • Input: 500 × 10,000 = 5M tokens/day × $0.30 = $1.50/day

  • Output: 300 × 10,000 = 3M tokens/day × $2.50 = $7.50/day

  • Daily total: $9. Monthly: ~$270.

Same scenario on Gemini 3.1 Pro Preview: roughly $1,170 per month. More than 4× the bill for the same conversation, before any quality measurement.

Example 2: RAG knowledge assistant on Gemini 2.5 Pro

8,000 tokens of retrieved context plus a 200-token question, returning a 600-token answer. 2,000 queries per day.

  • Input: 8,200 × 2,000 = 16.4M tokens/day × $1.25 = $20.50/day

  • Output: 600 × 2,000 = 1.2M tokens/day × $10 = $12/day

  • Daily total: $32.50. Monthly: ~$975.

If the same workload routes 25% of queries above 200K context, the input rate on that slice doubles to $2.50 and monthly spend climbs past $1,200. Same code, no model change, bigger invoice.

Example 3: High-volume classification on Flash-Lite + Batch

A 300-token input classifying 500,000 records per day into 50-token labels. Run through the Gemini Batch API at 50% off.

  • Input: 150M tokens/day × ($0.10 × 0.5) = $7.50/day

  • Output: 25M tokens/day × ($0.40 × 0.5) = $5/day

  • Daily total: ~$12.50. Monthly: ~$375 for half a billion classifications.

The same workload on Gemini 3.1 Pro Preview at standard rates is roughly $40,000 per month. Model choice plus Batch is a 100× swing on identical functional output.

Why Your Bill Outruns the Rate Card

The rate card is honest. What inflates real invoices is usually one of these.

The 200K context cliff: On Gemini 3.1 Pro Preview and Gemini 2.5 Pro, prices double once a single prompt crosses 200,000 tokens, per Google's pricing page. RAG pipelines that aggressively pack context, long conversation histories, and document-analysis jobs cross that line constantly. Most teams discover the cliff only after a weekly bill review forces the question.

Modality mismatch on the rate card: Audio input is the most common surprise. On Gemini 3.1 Flash-Lite, audio is twice the text rate, and the Gemini 3.1 Flash Live Preview prices audio input at $3 against $0.75 for text, per Google's pricing page. A voice agent budgeted as text traffic ships at 4× the forecast.

The Vertex AI fork in the road: The same Gemini model bills at a different effective cost depending on whether you call it through the Gemini Developer API or through Vertex AI. The base per-token rate is largely aligned, per Google's Vertex AI generative pricing, but Vertex adds Provisioned Throughput, regional egress fees, IAM, and observability. For teams weighing this path, our breakdown of the OpenAI API vs Bedrock vs Vertex AI choice walks the trade in detail.

Grounding caps that look free until they aren't: Google Search grounding includes 500 to 5,000 free requests per month depending on the model family, per Google's pricing page. One agent looping at five grounded queries per session blows through 5,000 calls inside a single working day.

Output cascading into input: Multi-turn conversations and agent loops turn last turn's output into next turn's input. A five-step agent that emits 1,000 tokens per step pays input on 1,000, then 2,000, then 3,000, then 4,000 tokens of accumulated context. The same compounding pattern shows up in self-hosted inference workloads, where it surfaces as GPU cost on training and serving.

Retries and idle context: Failed JSON parses, function-call timeouts, and reconnects all resend the prompt. Every retry pays full input.

Ways to Cut Gemini API Costs Without Cutting Quality

Five levers, ranked by realistic impact.

  1. Route by task, not by default: Send classification, extraction, summarization, and structured output to Flash-Lite. Reserve Pro tiers for reasoning, code, and long-context synthesis. Most production traffic is over-provisioned by one or two tiers.

  1. Use context caching aggressively: Cached input on Gemini reads at roughly 10% of the base rate, with a per-hour storage charge of $1 to $4.50 depending on model, per Google's caching documentation. On any workload with a stable system prompt above the cache minimum, the effective input rate is closer to the cache column than the headline column. Storage cost crosses over only on prompts that sit idle.

  1. Push everything non-real-time to Batch: The Gemini Batch API gives 50% off both input and output with a 24-hour completion window. Overnight evaluations, embeddings refreshes, content classification, and offline labeling almost never need sub-second response.

  1. Stay below 200K on Pro tiers: Chunk and rerank rather than packing context. If a workload genuinely needs longer context, model down to Flash or Flash-Lite, where prices are flat across context length.

  1. Cap output, not input: Set max_output_tokens against what the downstream UI actually renders. A 4,000-token ceiling on a feature that displays 200 tokens is a 20× tax on every call.

From Rate Card to Unit Economics

A rate card tells you the price of a token. It does not tell you which customer, feature, or release is burning your Gemini budget.

This is the gap Amnic closes. Amnic's AI token cost management attributes every API call to the customer, feature, environment, and team it served, surfaces model-routing drift before it lands on the invoice, and forecasts spend per unit of business value, not just per million tokens. For teams sizing up the category first, our guide to dedicated FinOps tools for AI workloads covers the field.

Running the same exercise on the rest of your model stack: our companion breakdown for Mistral API pricing walks the same playbook.

Key Takeaways

  • Gemini API pricing is per token, split between input and output, with output running four to six times input across the family.

  • The spread from Flash-Lite to Pro Preview is roughly 30× on output. Model choice is the single biggest lever.

  • The 200K context cliff only bites Pro tiers. Crossing it doubles the input rate on that prompt.

  • Audio input is priced separately on several models. A voice workload budgeted as text traffic ships well over forecast.

  • Context caching cuts effective input cost by roughly 90%, and Batch cuts both sides by 50%. Most production traffic qualifies for at least one.

  • Cost per million tokens is the wrong unit. Cost per customer, per feature, per workflow is the unit that lets you actually decide what to ship.

Frequently Asked Questions

What is the cheapest Gemini API model?

Gemini 2.5 Flash-Lite at $0.10 input and $0.40 output per 1M tokens is the lowest standard rate in the Gemini family, per Google's developer pricing page.

How is Gemini API pricing calculated?

Cost equals input tokens times input price plus output tokens times output price, both divided by one million, per Google's pricing documentation. Pro tiers double above 200K input tokens.

Is the Gemini API free to use?

Yes, with rate limits. The free tier offers access to most Gemini models without a credit card, capped at 5 to 15 requests per minute and 250,000 tokens per minute, per Google's rate-limit page.

What is the 200K context cliff on Gemini?

Gemini 3.1 Pro Preview and Gemini 2.5 Pro switch to a higher input and output rate once a single prompt crosses 200,000 tokens, per Google's pricing page. Flash and Flash-Lite charge flat regardless of length.

Does Gemini charge for cached prompts?

Yes, but cached input reads at roughly 10% of the base rate, plus a per-hour storage charge of $1 to $4.50 depending on model, per Google's caching documentation.

How much discount does the Gemini Batch API give?

50% off both input and output with a 24-hour completion window, per Google's batch API documentation.

What is a token in Gemini pricing?

A token is a chunk of text roughly equal to four characters or one short word in English, per Google's tokenization documentation. Image, audio, and video are also denominated in tokens.

Is Gemini API cheaper than Vertex AI?

The base per-token rates align across both, per Google's Vertex AI generative pricing. Vertex adds Provisioned Throughput, IAM, regional egress, and SLA, which raise the effective cost in exchange for enterprise controls.

How do I estimate my Gemini API monthly bill?

Estimate average input and output tokens per call, multiply by daily call volume, apply the per-million rate for your chosen model, and multiply by 30. Sanity check against the usage dashboard after the first week of live traffic.

How can I control Gemini API costs in production?

Route by task to the lowest viable tier, keep Pro prompts below 200K, cache stable prefixes, push non-real-time work to Batch, cap output tokens, and attribute spend to the customer or feature that drove it.

FinOps OS powered by context-aware AI agents.

Start with a 30-day no-cost trial.

Read-only.

No credit card.

No commitment.

Want to assess how your FinOps journey can scale?

Benchmark maturity, close governance gaps, and drive ROI in under 20 minutes

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD