Back

What Is LLM Inference? How It Works and What It Costs

June 26, 2026

8 min read

Amnic

AI for FinOps

No headings found on page

LLM inference is the process of running a trained large language model to turn a prompt into output tokens. The model reads your input, predicts the next token and repeats until the response is complete. No learning happens during this step. The weights stay frozen, so every prompt you send to a model in production is one inference call.

This is the part of an LLM that actually does the work. Training builds the model once. Inference runs it forever, every time a user types a question, an agent calls a tool, or a feature handles live AI workloads. Understanding how inference works is the first step toward controlling what it costs, which is where most teams get surprised.

What LLM inference means in plain terms

LLM inference is a read-only, forward-only pass through a trained model. The model takes your prompt, runs it through its transformer layers and produces output one token at a time. It does not update its parameters, store your prompt as new knowledge, or learn from the exchange. It applies patterns it has already learned during training to predict the most likely next token.

A useful way to frame it: training is the work and cost that go into building the model and inference is the work and cost that run the finished model. The build is a one-time fixed cost. Inference is an ongoing marginal cost that scales with every request, which is why it dominates the budget for any AI feature in production. Teams running real production traffic feel this shift fast.

LLM inference vs training vs fine-tuning

These three get confused constantly, so it helps to separate them cleanly. Training is the initial process that creates the model by adjusting billions of weights across massive datasets. Fine-tuning takes an already trained model and adjusts its weights further on a narrower dataset. Both change the model. Inference changes nothing in the model itself.

Process	What it does	Changes the weights?	How often it runs
Training	Builds the model from scratch on massive datasets	Yes	Once
Fine-tuning	Adapts a trained model on a narrower dataset	Yes	Occasionally
Inference	Runs the finished model to answer a prompt	No	Every single request

Inference is the only one of the three that happens every time someone uses the product. A model is trained once, fine-tuned occasionally and inferred against millions of times. That difference in frequency is the reason inference, not training, is where cost accumulates over a model's life. The economics of GPU for AI training and inference are not the same problem.

How LLM inference works: prefill and decode

Inference runs in two distinct phases. The prefill phase reads your entire prompt at once. The model processes all input tokens in parallel, works out how each token relates to the others and stores the result in memory, called the KV cache. This phase is compute-bound and sets your time to first token, the delay before the first word appears.

The decode phase generates the response one token at a time. The model predicts a token, appends it to the running text and uses that longer sequence to predict the next token. This autoregressive loop repeats until it hits a stop signal or a length limit. Decode is memory-bandwidth-bound, because the GPU must read the entire model from memory for every single token it produces.

The KV cache is what keeps decode from re-reading your whole prompt on every step. It stores the attention values computed during prefill so the model reuses them instead of recomputing. The cache grows with every token and with every concurrent request, which is why GPU memory, not raw compute, is usually the limit on how many users you can serve at once.

How inference performance is measured

Engineers track a small set of metrics that decide whether an LLM feature feels fast and how much it costs to serve. Time to first token measures how long the prefill phase takes before the first word appears. A model that takes five seconds to start feels slow even if it generates quickly after that, so this metric maps directly to perceived responsiveness.

Throughput measures how many tokens the system produces per second across all concurrent requests. Time per output token, sometimes called inter-token latency, measures the gap between each generated token during decode. Together, these define the latency a single user feels and the total load the system can carry. Batching more requests raises throughput but can lower per-user speed once memory runs short, which is why real GPU utilization rarely reaches its theoretical peak.

Tokens per second is the number practitioners reach for first when they ask whether performance is good. A chat interface needs to generate faster than a person reads, or the response feels slow, while a batch summarization job can tolerate far less speed. These numbers drive both user experience and the hardware you need, which is the bridge to cost.

Why LLM inference needs so much compute

Inference is resource-intensive because the decode phase is sequential and memory-bound. Every output token requires reading billions of parameters from GPU memory, so the GPU often spends more time waiting for data than doing math. Expensive accelerators can sit underused during inference and serving costs stay high even as published AI GPU pricing per hour drifts down.

GPUs are the default hardware because they read from memory and run matrix operations far faster than CPUs. Real-time inference at scale usually depends on them, which ties LLM spend directly to GPU supply and utilization. Anyone budgeting for production inference has to think about GPU cost optimization before committing to an architecture, because self-hosting and managed APIs price the same risk in very different ways.

Provider APIs hide the GPUs and charge per token instead. Services like Amazon Bedrock and the major model APIs let you skip hardware entirely, trading capital cost for a usage-based bill. That bill is easy to start and hard to predict, because token consumption grows with adoption in ways that rarely show up in a small pilot. The work shifts from running servers to governing spend.

What LLM inference actually costs you

Here is what almost no explainer covers. Inference is not a single API call. One user action can trigger several model calls once you add retries, tool use and error handling and an AI agent can fire dozens of sequential calls to finish one task. Teams that never learned how to track AI cost routinely report inference bills climbing several times over between a pilot and full production as this fan-out compounds.

The trap is treating the price per token as your cost. Price per token is set by the provider. Your actual cost depends on token efficiency, the number of tokens a feature burns to finish a job, which varies wildly across prompts, models and retry behavior. A cheaper model that needs three times the tokens is not cheaper. Comparing options well means looking past the sticker rate, which is what an honest LLM cost comparison is for.

The deeper problem is allocation, the exact gap that FinOps for AI was built to close. A token has no owner. An inference call leaves no taggable cloud resource behind, so default billing tells you the total but never which feature, customer, or team caused it. To answer that, cost has to be attributed at the application layer by tagging every call with a feature and customer identifier, then aggregating per unit. An AI cost management platform for enterprise turns that attribution into a standing view instead of a quarterly scramble through invoices.

The metric that matters is cost per outcome, not cost per token. Cost per resolved support ticket, per generated report, or per active customer tells you whether a feature pays for itself. Reaching it requires LLM observability on how tokens flow through each feature in production. Without that visibility, you are optimizing blind and guessing at which workloads drive the bill.

How to control inference cost

Start with the engineering levers, because they cut the bill without touching adoption:

Prompt caching reuses a repeated prefix instead of reprocessing it, so the context you resend on every call stops being charged at full price.
Batch processing trades real-time latency for a sizable discount on jobs that do not need an instant answer, like overnight summarization or evaluation runs.
Quantization stores model weights in lower precision, which reduces the memory each token must read and lifts throughput on the same GPU.
Model routing sends simple requests to a small model and reserves the large one for hard tasks, since most production traffic never needs a frontier model. Disciplined AI token management keeps that routing honest as prompts and features multiply.

The levers only pay off if you can measure them per feature. Pairing every optimization with AI cost visibility tools is what turns a one-time saving into ongoing control, because the tooling ties spend to the exact features and customers that drive it, instead of a single monthly total that hides the detail.

Engineering work lowers the price of one call, but it cannot tell you whether that call was worth serving. Without per-feature visibility, every efficiency gain disappears the moment traffic shifts to a new feature or customer tier. Dedicated LLM cost allocation tools then keep that attribution accurate as the product grows, so a cheaper bill this quarter does not quietly become an untraceable one the next.

Why it matters

LLM inference is where the patterns a model learned become a product people use and where the recurring bill lives. Understanding the prefill and decode phases tells you why it is slow and memory-hungry. Understanding allocation tells you why it is expensive and how to fix it. The teams that treat inference as a FinOps problem, not just an engineering one, are the ones that scale AI features without watching margins quietly erode. That is the difference between a feature that ships and a feature that pays for itself.

FAQs

What is LLM inference in simple terms?

LLM inference is using a trained model to generate output from a prompt. The model predicts tokens one at a time based on patterns learned in training. It does not update its weights, so every request you send to a live model is an inference call.

Is LLM inference the same as running the model?

Yes. Running a trained model to get a response is inference. Every prompt you send to ChatGPT, an API, or a self-hosted model triggers one inference call. The model applies what it already learned and produces output without changing itself.

What is the difference between LLM training and inference?

Training builds the model by adjusting its weights on large datasets and it happens once. Inference runs the finished model to answer prompts and it happens every time the product is used. Training is a fixed cost; inference is a recurring one.

Why is LLM inference so expensive?

Inference repeats for every request, so its cost grows with usage, while training is paid once. Each output token also reads the full model from GPU memory, keeping hardware costs high. Over a model's life, inference usually exceeds total training spend.

What are the two phases of LLM inference?

Prefill and decode. Prefill reads the whole prompt in parallel, builds the KV cache and sets time to first token. Decode generates the response one token at a time in a loop. Prefill is compute-bound; decode is memory-bandwidth-bound.

Is a GPU required for LLM inference?

Not strictly, but it is the practical default for real-time use. CPUs can run small models, but generation is slow enough that responses feel sluggish. GPUs read memory and run matrix math fast enough to serve users at scale, which is why production inference relies on them.

Does price per token tell me my inference cost?

No. Price per token is the provider's rate, not your cost. Actual cost depends on token efficiency, how many tokens a feature burns per task, plus retries and agent fan-out. A cheaper rate on a wasteful model can cost more than a pricier, efficient one.

Better visibility and management into AI Tokens?

Start with a 30 day trial

Connect leading LLMs

24 hour time to value

Stay ahead of AI Spend

Request a Demo