How to Reduce Inference Cost: A FinOps Playbook
10 min read
Cloud Infrastructure

Table of Contents
Inference cost is the bill you pay every time a deployed model answers a request. It scales with usage, not with a one-time training run, so it becomes the largest line in most AI budgets once a product ships.
The fastest way to bring it down is to measure cost per token and per request, then apply fixes in order of return instead of reaching for the most advanced trick first. A standing AI token management practice is what turns that measurement into a habit rather than a once-a-quarter scramble.
Treat this as a measurement problem before an engineering one. You cannot cut what you cannot see, and a single number on a cloud invoice hides which model, feature, or customer drives the spend. Seeing that clearly is the whole premise of FinOps for AI, and once cost is attributed, the levers below sort into clear priorities so you stop tuning parts of the stack that barely move the bill.
What counts as inference cost
A full picture of what the inference cost breaks into two layers. The first is the per-token price you pay an API provider, or the GPU hours you rent to host a model yourself. The second is efficiency: how many tokens per second your hardware produces, how full your batches run, and how much memory the cache wastes. A clear LLM cost comparison across providers settles the first layer, while the second is where self-hosted teams win or lose.
The unit that matters is cost per million tokens, not GPU hours alone, and a quick token counter turns raw prompts into the token figures every estimate depends on. Without token counts, you cannot tell rising traffic from falling efficiency, and both look identical on the invoice. Engineers also track tokens per second and tail latency at the p95 and p99 marks, because a cheap setup that misses latency targets stops being cheap once you count the users it loses.
Measure and attribute before you optimize
Start by tagging every inference call with the model, feature, and team that triggered it. This is the discipline that AI cost tracking tools bring to cloud spend, applied to token usage instead. Attribution turns a vague monthly figure into a ranked list of the features and customers that cost the most, which is the only honest basis for deciding what to fix.
At a minimum, attach these tags to every request before it reaches a model:
Model and provider, so a GPT-4 class call is never averaged in with a small open model
Feature or endpoint, so you can see that search costs four times what onboarding does
Team or cost center, so the bill maps to an owner who can act on it
Environment, so a debug loop in staging never hides inside production spend
Input and output tokens, the raw numbers behind every cost per token figure
Pair tracking with dashboards that refresh in real time so a runaway prompt or a retry loop surfaces in hours, not at the end of the billing cycle. Many teams find that a small share of requests drives most of the spend, usually long prompts or an agent that calls the model far more than expected. Catching that early is exactly what AI cost visibility tools are built for, and fixing those few paths almost always beats any model-level trick.
Attribution also answers the question that decides everything else: which workload to optimize. A batch summarization job and a customer-facing chat endpoint have opposite cost profiles, and one tuning recipe will overspend on the other. Telling those workloads apart per team and per product is the job of LLM cost allocation tools, so each one gets the treatment it actually needs.
Model-level levers
Quantization is usually the highest return change you can make. Dropping weights from 16-bit to INT8 roughly halves memory and lifts throughput, and large model evaluations across hundreds of thousands of prompts show 8-bit retaining about 99 percent of full precision quality (source). Quality loss concentrates in math and long context tasks, so test on your own samples before you ship.
Smaller models and distillation cut costs at the source. Match model size to task difficulty rather than sending every request to a frontier model, then use a large teacher to train a smaller student for the routine work. Understanding token economics helps here, because a cheaper model covering the bulk of traffic compounds into savings on every single call.
Prompt caching removes redundant compute. When a long system prompt repeats on every call, caching its computed state means later requests pay a fraction of the input price, and provider docs put cached input near 10 percent of the base rate, a 90 percent saving on that portion (source). The win grows with how often the same prefix repeats across calls. Retrieval augmented generation does the same for data, trimming what each request sends rather than padding your AI workloads.
To make that concrete, picture a support chatbot that ships a 2,000 token system prompt and knowledge base on every one of its million daily calls:
Without caching, all 2,000 prefix tokens bill at full input price on every call, so the prefix alone burns two billion input tokens a day.
With caching, that prefix drops to roughly 10 percent of the rate after the first call, cutting the prefix portion of the bill by about 90 percent.
The user's own question, the only part that actually changes, still bills in full, which is exactly where the spend belongs.
Serving and runtime levers
The serving engine decides how much capacity you waste. Continuous batching groups tokens from many requests so the GPU rarely sits idle, and it lifts utilization without hurting latency at most percentiles. Teams that already own a GPU fleet can reuse it for serving, which makes batching the single biggest infrastructure win available to them.
Memory management compounds that gain. PagedAttention stores the key value cache in non-contiguous blocks and cuts wasted memory from the 60 to 80 percent typical of older systems to under 4 percent, raising throughput two to four times on the same hardware (source). More room for concurrent requests means more tokens per dollar with no new GPUs.
Speculative decoding speeds generation by letting a small draft model propose tokens that the large model verifies in parallel. It lowers latency and raises throughput on the same chip, which matters most for interactive endpoints. Strong LLM observability is the prerequisite, because you need per-request token and latency data to confirm these changes pay off rather than assuming they do.
Infrastructure-level levers
Autoscaling stops you from paying for idle GPUs. Scale on queue depth or batch metrics rather than raw GPU utilization, since utilization lags real demand and leaves capacity stranded. This sits at the center of managing infrastructure for generative AI, where a handful of always-on instances can quietly cost more than the model serving itself.
Spot capacity is the largest hourly discount on the table. AWS offers spare capacity at up to 90 percent off on-demand prices, with the catch that instances can be reclaimed on two minutes' notice (source). It fits stateless batch inference well, since a reclaimed request simply gets reissued. That makes it a natural extension of broader GPU cost optimization once you add retry logic and a small on-demand buffer for latency-critical traffic.
Hardware choice sets the floor under everything else. Newer accelerators produce far more tokens per second per dollar, and a chip that costs more per hour is often cheaper per token because of its higher throughput. Anchoring on an older instance that quietly lost the per-token race is a common trap, which is why disciplined teams watch AI GPU pricing across providers and generations before they commit.
Application-level levers
Model routing sends each request to the cheapest model that can handle it. Easy, high-frequency queries go to a small model while complex reasoning reaches the frontier tier, and routing alone cuts spend in mixed workloads. Routing pays off only when something compares live price and latency across vendors before each call, which is what a multi-provider LLM cost management tool is built to do.
On-device inference removes the server bill entirely for some request paths. Routine requests run locally on the user's own hardware, and only the genuinely hard cases get forwarded to the cloud at all. For teams already committed to managed endpoints, knowing when a hybrid split saves money rather than adding complexity for its own sake comes down to a clear view of the true cost of Azure OpenAI.
Self-hosted or managed API
There is no universal answer, only a break-even point. Below your own crossover volume, a managed API wins because the provider already runs inference at a scale you cannot match, and you pay nothing for idle hardware. Above it, self-hosting wins because fixed GPU capacity gets cheaper per token the more you keep it busy. Pricing references such as Amazon Bedrock give you the managed baseline to compare against.
The honest cost of self-hosting includes the work around the model. Spot recovery, monitoring, load balancing, egress, and the engineers who keep it running all add to the per token figure, and they often flip a calculation that looked favorable on raw GPU price alone. A short audit with FinOps tools for AI cost management gives you the fully loaded number on both sides before you commit.
A sequence that works
Before the order, here is how the main levers compare on effort and the savings each one is documented to deliver:
Lever | What it does | Documented impact |
|---|---|---|
Prompt caching | Reuses the computed state of a repeated prefix | Cached input near 10% of the base, about 90% off the repeated portion |
INT8 quantization | Drops the weights from 16-bit to 8-bit | Roughly half the memory, about 99% of full precision quality |
Continuous batching + PagedAttention | Packs many requests and stores the KV cache in blocks | Wasted memory 60-80% down to under 4%, throughput up 2 to 4 times |
Spot capacity | Rent spare cloud GPUs | Up to 90% below on demand, with two minutes of interruption notice |
Model routing | Sends easy queries to a smaller model | Large cut in mixed workloads, no quality loss on simple calls |
Apply the levers in order of return and effort, not novelty:
Attribute spend per model, feature, and team, the foundation every AI cost tracking routine builds on, so you optimize the paths that matter.
Cache repeated prompts and trim context; the cheapest wins with no quality risk.
Quantize and right-size the model after testing quality on your own data.
Move serving onto a continuous batching engine and confirm the gain.
Route easy queries to smaller models across providers.
Add spot capacity and autoscaling for fault-tolerant batch work last.
Run this loop on a schedule, not once. Inference cost drifts as traffic, prompts, and model prices change, so a quarterly pass backed by AI cost governance tools keeps the gains from eroding.
Common mistakes to avoid
The recurring traps are easy to name and easy to walk into:
Optimizing before measuring, so effort lands on a model trick while a retry loop drives the real bill
Judging hardware by the hour, when a pricier GPU is often cheaper per token because it serves more requests at once
Caching nothing, leaving a repeated system prompt to bill at full price on every single call
Routing everything to the frontier tier, paying reasoning prices for queries that a small model would answer
The most expensive mistake is optimizing before measuring. Teams quantize a model or chase a faster engine while a single retry loop or an oversized context window drives the real bill. Reading cost per token and per request first, the way you would read any cloud line item, points you at the few changes that move the number.
The second mistake is judging hardware by its hourly price. A larger or newer GPU can cost more per hour and still be cheaper per token because it serves far more requests at once. Optimizing for tokens per dollar rather than dollars per hour is what separates a real saving from a number that only looks lower on the rate card.
Conclusion
Reducing inference cost is a measurement problem first and an engineering problem second. Attribute the spend, find the few workloads and requests that dominate it, then apply caching, quantization, batching, routing, and spot capacity in the order that returns the most for the least effort. The teams that win treat this as a standing FinOps practice, not a one-time cleanup, and they keep the gains by watching cost per token as closely as they watch latency.
FAQs
What is inference cost?
Inference cost is the recurring spend to run a trained model in production. It is the per-token price charged by an API, or the GPU hours used to self-host, plus the efficiency of how those tokens are served. It scales with usage, unlike a one-time training cost.
What reduces inference cost the most?
Measuring and attributing spend first, then prompt caching, quantization, and continuous batching usually deliver the biggest returns for the least effort. Caching can cut repeated prefill cost by around 90 percent, and continuous batching with PagedAttention raises throughput two to four times on the same GPUs.
Does quantization hurt model quality?
Not much at 8-bit. Evaluations across hundreds of thousands of prompts show INT8 keeping roughly 99 percent of full precision quality. Loss grows at 4 bits and below, and concentrates in math and long context tasks, so test on your own samples before shipping aggressive quantization.
Is self-hosting cheaper than using an API?
Only above a break-even volume. Below it, a managed API wins because providers run inference at scale, and you pay nothing for idle hardware. Above it, self-hosting wins if you keep GPUs busy and account for spot recovery, monitoring, and engineering overhead.
Are spot instances safe for inference?
Yes for stateless, fault-tolerant batch work, where AWS spot capacity runs up to 90 percent below on-demand prices with two minutes of interruption notice. Keep latency-critical, customer-facing traffic on on-demand or reserved capacity, and add retry logic for the spot portion.
How do I measure cost per token?
Tag every inference call with the model, feature, and team that triggered it, then divide spend by tokens served for each segment. Tracking cost per million tokens, rather than GPU hours alone, lets you separate rising traffic from falling efficiency on the same bill.
Better visibility and management into AI Tokens?
Start with a 30 day trial
Connect leading LLMs
24 hour time to value
Stay ahead of AI Spend

Make AI spend visible, controllable, and accountable.
Gain insights into your AI token costs at a team, customer, business unit and individual user level to measure and manage AI utilization.










