Back

Vertex AI Pricing: A Complete Breakdown of Costs and How to Control Them

June 24, 2026

9 min read

Amnic

Tools

No headings found on page

Vertex AI pricing is pay-as-you-go and charged per service, so your bill is the sum of token usage, training compute, deployed endpoints, and supporting tools rather than one flat subscription. Generative model calls are billed per million tokens, while training and prediction are billed by the hour. That mix is why two teams running the same model can see very different invoices.

The hard part is rarely a single rate. It is that Google splits pricing across several pages with no unified calculator, and some charges keep running even when nothing is in use. This guide breaks down each component with figures pulled straight from Google's own pricing pages, then shows where bills leak and how a FinOps approach keeps Vertex spend predictable.

What Is Vertex AI?

Vertex AI is Google Cloud's managed platform for building, tuning, deploying, and serving machine learning and generative AI models. Google has rebranded the platform as the Gemini Enterprise Agent Platform, so you will see both names across the docs and pricing pages, but the product and billing logic are the same. It brings together Gemini models, custom training, online and batch prediction, vector search, Agent Builder, and notebooks under one roof.

Because it sits inside Google Cloud, Vertex AI inherits GCP billing mechanics like committed use discounts, regional rate differences, and network egress charges. That tight integration is a real advantage for teams already on GCP. It also means Vertex costs rarely appear in isolation, so they need the same discipline you apply to the rest of your cloud bill.

How Vertex AI Pricing Works

Vertex AI gives you four ways to pay, and picking the right one is the single biggest lever on your bill. The token rates themselves follow standard token economics, where input and output are metered separately. The four modes are:

Pay-as-you-go: charges per token or per node-hour with no commitment, ideal for spiky or early-stage workloads.
Provisioned throughput: reserves dedicated capacity for steadier latency at a committed monthly or annual rate.
Batch prediction: runs non-urgent jobs asynchronously at a 50% discount on token rates, with a 24-hour service window.
Context caching: stores repeated input so you are not re-billed full price for the same prompt prefix on every call.

Each mode targets a different workload, and most mature teams blend all four. For example, a nightly report generator can run on a batch for half price, while a live chatbot uses provisioned throughput for predictable latency. Matching the mode to the job is where the early savings come from.

Vertex AI Gemini Model Pricing

Generative model calls dominate most Vertex bills, and they are priced per million tokens split between input and output. The current published rates are:

Gemini 2.5 Flash-Lite: $0.10 per million input tokens and $0.40 output, the cheapest tier for routine work.
Gemini 2.5 Flash: $0.30 input and $2.50 output, a balance of cost and capability.
Gemini 2.5 Pro: $1.25 input up to 200K context and $10 output, rising to $2.50 input above that context length.

The spread between Flash-Lite and Pro is roughly twentyfold, so model selection moves your bill more than any negotiated discount. A support bot that classifies tickets rarely needs Pro, yet teams often default to it. For a side-by-side view across vendors, this LLM cost comparison shows where Gemini lands against the field.

Context caching is the quiet win. Cached input for Gemini 2.5 Pro drops to $0.13 per million tokens, close to a 90% reduction on repeated prompt prefixes. If you send the same system prompt or reference document across thousands of calls, caching changes the math entirely. The standalone Gemini API pricing uses the same token logic outside the Vertex platform.

Custom Training and GPU Pricing

Training and fine-tuning are billed per node-hour, and the rate depends on the machine and accelerator you attach. Supervised fine-tuning is billed by training tokens, calculated as dataset tokens multiplied by the number of epochs. Custom training jobs add GPU charges on top of the base node, so an A100 or H100 run accrues both compute and accelerator cost for every hour the job is alive.

GPU hours are where training budgets vanish fastest, especially on multi-GPU runs that span days. As an example, a team fine-tuning a model over a weekend on eight H100s can spend more on that single run than on a full month of inference. The practical controls are covered in our guide to AI GPU pricing and the roundup of GPU cost optimization tools, both of which apply directly to Vertex training nodes.

Online Prediction and Endpoint Pricing

This is the section that catches teams off guard. When you deploy a model to a Vertex AI endpoint, you pay for the underlying machine continuously, whether or not it serves a single request. Vertex does not support automatic scale-to-zero for deployed models, so charges keep running until you explicitly undeploy.

That billing model is the root of most surprise invoices. A common example: an A100-backed endpoint left deployed after a Friday demo keeps billing through the entire weekend even with zero traffic. Practitioners on the r/googlecloud community regularly report bills for endpoints they forgot to undeploy, with no line item that makes the cause obvious. The same continuous-billing logic governs the Google Compute Engine costs underneath those endpoints.

Agent Builder, Vector Search, and Other Vertex AI Services

Beyond models, Vertex bills several supporting services by the hour or by query:

Agent Builder and Agent Engine: charged per vCPU-hour and per GiB-hour of memory, plus per-query rates for search and model calls.
Vector Search: bills for index serving nodes that run continuously, so even a modest index carries a steady monthly floor.
Grounding with Google Search: $35 per 1,000 grounded prompts above the free daily allowance.
Notebooks, Feature Store, and pipelines: each add their own metered charges by the hour or per run.

None of these is large on its own, but together they form the long tail that makes Vertex bills hard to forecast. A retrieval-augmented chatbot, for instance, can quietly carry vector search nodes, grounding queries, and an endpoint all at once. The wider GCP cost optimization tools landscape helps surface them before they compound.

Vertex AI vs AI Studio: Which Pricing Path Fits

Buyers routinely confuse Vertex AI pricing with the Gemini API through Google AI Studio, and the distinction changes your cost profile. AI Studio is the lightweight developer path with a free tier and simple per-token billing, well-suited to prototyping and small apps. Vertex AI is the enterprise path with the same token rates plus governance, regional control, and the managed infrastructure larger teams need.

The token prices for Gemini models are largely identical across both paths. What differs is everything around them, namely security controls, SLAs, provisioned throughput, and integration with GCP billing and IAM. If you only need the model, AI Studio is cheaper to start; if you need allocation, compliance, and scale, Vertex earns its overhead. For broader context across clouds, compare OpenAI API vs Bedrock vs Vertex AI, and for rival rate cards, see Perplexity API pricing and Grok API pricing.

Hidden Costs That Inflate Your Vertex AI Bill

The figures in pricing tables rarely match the invoice, because some charges have no obvious line item. The usual culprits are:

Idle endpoints: deployed models with no scale-to-zero, billing around the clock with zero traffic.
Network egress: data leaving a region or the cloud, charged per GB, and easy to miss.
Storage: training datasets, model artifacts, and feature data accumulate quietly month over month.
Long conversation context: every chat turn resends prior context as input tokens, so a 50-message support session bills far more than the per-call rate suggests.

Untagged resources make all of this worse, because you cannot attribute spend back to a team or feature. Catching these patterns early depends on reading your Google Cloud billing reports at the SKU level rather than trusting a headline number. A single managed view across providers, like a multi-provider LLM cost management tool, removes the spreadsheet reconciliation entirely.

How to Reduce Vertex AI Costs

The highest-leverage moves are simple to state and steady to maintain:

Route to the cheapest adequate model, since Flash-Lite handles a large share of tasks at a fraction of Pro's rate.
Push non-urgent jobs to batch for the 50% discount, such as overnight enrichment or report generation.
Cache repeated input aggressively wherever a fixed prompt or document recurs.
Scale dev and staging endpoints to zero replicas when idle, and script automatic teardown after a fixed period of inactivity.

On commitments, our breakdown of GCP CUD vs SUD explains which discount fits a steady production load. Layer in cloud cost forecasting so spikes are predicted rather than discovered on the invoice, and lean on purpose-built Vertex AI cost optimization tools to automate the cleanup.

Why Vertex AI Cost Visibility Matters

Cutting rates only helps if you can see where the money goes, and Vertex spend is hard to read because it spreads across models, endpoints, and supporting services. Granular cost allocation ties each dollar to a team, feature, or customer, which turns one confusing invoice into decisions you can defend. Purpose-built LLM cost allocation tools extend that down to the model and prompt level.

Strong cloud cost allocation also makes regressions visible the moment they appear. Pairing tagged spend with anomaly detection means a forgotten endpoint or a runaway batch job triggers an alert within hours, not at month-end. Wrapping policy around it with AI cost governance tools keeps the controls in place as usage grows.

Conclusion

Vertex AI pricing is logical once you separate it into models, training, prediction, and supporting services, then watch the charges that accrue whether you use them or not. The token rates are public and competitive, but the bill is shaped by how you deploy, cache, and tear down resources. Treating Vertex as a metered platform rather than a flat product is the mindset that keeps costs in line.

The teams that stay in control pair model discipline with allocation and alerting, which is exactly what mature FinOps for AI looks like in practice. If you are operationalizing that, start with these FinOps tools for AI cost management and build the visibility layer before the next surprise invoice lands.

FAQs

How much does Vertex AI cost?

Vertex AI is pay-as-you-go with no flat fee. You pay per million tokens for Gemini models, per node-hour for training and endpoints, and per query or hour for services like Agent Builder and Vector Search. Your total is the sum of what you actually run.

What is the cheapest Gemini model on Vertex AI?

Gemini 2.5 Flash-Lite is the cheapest, at $0.10 per million input tokens and $0.40 per million output tokens. It handles many routine tasks well, so routing simple work to it instead of Pro is one of the fastest ways to cut costs.

Why is my Vertex AI bill higher than my token usage?

Most often, it is idle endpoints. Vertex does not auto-scale to zero, so a deployed model bills continuously until you undeploy it. Network egress, storage, and long conversation context that resends tokens each turn also inflate bills beyond raw call counts.

Is Vertex AI free to use?

New Google Cloud accounts get $300 in credits valid for 90 days, plus limited free monthly usage on some services. Beyond that, every model call, training job, and deployed endpoint is billed, so the free tier suits testing rather than production.

What is the difference between Vertex AI and Gemini API pricing?

The Gemini API through AI Studio shares similar token rates but is a lightweight developer path with a free tier. Vertex AI adds enterprise governance, SLAs, regional control, and GCP billing integration, which is why teams needing compliance and allocation choose it.

How can I reduce Vertex AI costs?

Route tasks to the cheapest adequate model, use batch prediction for a 50% discount, cache repeated input, scale idle endpoints to zero, and apply committed use discounts for steady load. Tagging and allocation make these savings measurable.

Better visibility and management into AI Tokens?

Start with a 30 day trial

Connect leading LLMs

24 hour time to value

Stay ahead of AI Spend

Request a Demo