What Is a Token in AI? Definition, Counting & Cost
7 min read
Engineering

Table of Contents
A token is the basic unit of text an AI model reads and generates. It can be a whole word, a piece of a word, a single character, or a punctuation mark. A large language model never sees your sentence as words. It sees a sequence of tokens, converts each one into a number and predicts the next token one step at a time.
This matters for more than curiosity. Tokens are also the unit you pay for. Every model provider bills by the token, so the same concept that explains how AI reads language also explains your invoice. If you run AI workloads in production, understanding tokens is the first step toward AI token management and a predictable bill.
One quick clarification before going further: an AI token is not a crypto token. Search results mix the two, but they are unrelated. A crypto token is a tradable digital asset. An AI token is a slice of text that a language model processes. This article is about the second kind.
What is a token in AI?
A token is the smallest chunk of text a language model treats as a single unit. Models do not learn from raw letters or full words. They learn from tokens drawn from a fixed vocabulary, then predict which token is most likely to come next. This is the mechanic behind every chatbot, code assistant and agentic AI system you use today.
Tokens do not map one to one with words. Common short words like "cat" or "the" are usually one token each. Longer or rarer words get split. "Tokenization" often breaks into "token" and "ization" and a made-up string like "Zylphora" can cost five or six tokens. Spaces and punctuation frequently get their own tokens too, which is why "Hello, how are you?" uses more tokens than the four words suggest.
How tokenization works
Tokenization is the step that turns text into numbers a model can compute on. It runs before the model does anything else.
Most modern models use a method called byte-pair encoding. It starts from individual characters and repeatedly merges the most common pairs into larger units, building a vocabulary of frequent fragments. Frequent words survive as single tokens. Rare words fall back to smaller pieces. This keeps the vocabulary small enough to be efficient while still covering any input, including typos, code and languages the model rarely sees.
Each token in that vocabulary has a fixed ID number. The model embeds that ID into a vector, processes the whole sequence and outputs one token at a time until it decides the response is complete. The exact split depends on the model, which is why the way AI tokenization works can differ between providers for the same sentence.
How many tokens are in a word?
A reliable rule of thumb for English is that one token is roughly four characters, or about three-quarters of a word. Put another way, 1,000 words land near 1,300 to 1,500 tokens depending on vocabulary.
That ratio shifts with content type. Code tends to be denser because symbols and indentation each carry weight and non-Latin scripts such as Japanese or Arabic use more tokens per character than English. So the same idea, written in two languages, can carry very different token counts and very different costs.
When you need an exact number rather than an estimate, use a tokenizer tool or a library such as tiktoken for the specific model you are calling. Knowing the count of a prompt tells you whether you fit inside the model limit and what the call will cost. Estimating tokens is part of counting tokens before a call and budgeting for it.
What is a context window?
A context window is the maximum number of tokens a model can handle in a single request, counting both your input and the output it generates. When a conversation crosses that limit, the model starts dropping the oldest tokens. Limits vary widely, from around 128,000 tokens on many production models to 200,000 and up to roughly two million on the largest context models. A bigger window lets you pass more context in one call, but it also means a single request can carry far more tokens and far more cost.
Tokens vs words vs characters
These three are easy to confuse, so here is the clean separation.
Unit | What it is | Rough English ratio |
|---|---|---|
Character | A single letter, digit, space, or symbol | 1 |
Token | A model-defined fragment, often a sub-word | ~4 characters |
Word | A human unit separated by spaces | ~1.3 tokens |
Characters are what you type. Words are how you read. Tokens are how the model counts. Billing and context limits both run on tokens, not on the other two, which is why the token view is the one that controls cost.
Input tokens vs output tokens
Every model API splits usage into two buckets that are priced differently. Input tokens cover everything you send, including the system prompt, instructions, context and the user message. Output tokens cover everything the model generates back.
Output tokens almost always cost more, typically three to five times the input rate and sometimes up to eight times. The reason is computational. Reading your prompt can be processed in parallel across hardware, while generating a response happens one token at a time in sequence, which is slower and more expensive.
The gap shows up directly in published rates. One flagship model lists input near $5.00 and output near $30.00 per million tokens, while lighter models cost a small fraction of that. That is a wide spread and it is why a head-to-head like Anthropic vs OpenAI, a single-model breakdown such as Gemini API pricing and a broader LLM cost comparison are all worth checking before you commit a workload to one provider.
Types of tokens you pay for
Modern model APIs price more than just input and output. Most production providers bill seven distinct token categories, each with its own rate and overlooking any one of them is a common cause of surprise invoices.
Token type | What it is | How it is billed |
|---|---|---|
Standard input | The prompt you send: system message, retrieved context, tools and user query | Per token at the base input rate |
Standard output | The text the model writes back | Per token, typically 3 to 8 times the input rate |
Cached input | A repeated prompt prefix served from a cache instead of recomputed | Sharply discounted on cache hits, often a small fraction of the fresh input rate |
Reasoning tokens | Hidden step-by-step thinking some models generate before the visible answer | Counted and billed as output tokens, invisible in the response |
Vision tokens | Images converted into tokens based on resolution and detail level | Billed as input, with a per-image token count set by the model |
Audio tokens | Speech in or out for realtime and voice APIs | Billed separately from text, usually at a higher per-token rate |
Embedding tokens | Text converted into vectors by an embedding model for search or retrieval | Billed on a separate, much cheaper embedding endpoint |
A few patterns catch teams out. Cached input only triggers on identical prefixes, so the savings depend on putting your stable system prompt and tools first and the user-specific content last. Reasoning tokens are invisible in the response, so a 200-word answer can carry several thousand billable tokens behind it.
Vision and audio tokens are not interchangeable with text tokens, so a multimodal call can cost more than a longer text-only one. The mix you actually run, not the headline price, is what shapes your bill.
Why token counts affect your AI bill
Because providers bill per token, your cost is a direct function of how many tokens move through your application. Three forces push that number up.
First, context. Every message you resend, every retrieved document and every example in the prompt adds input tokens to that specific call. Second, output length. Verbose responses cost more than tight ones. Third, volume. A feature that looks cheap per call becomes a large line item once millions of calls run each day.
This is where tokens connect to FinOps. Token spend behaves like any other variable cloud cost, so it benefits from the same discipline. Tying token usage back to features, teams, or customers turns a vague AI bill into real unit economics and treating it as a managed cost is the core idea behind FinOps for AI. Token cost is also only one layer. Models that you host yourself add the GPU spend behind AI training and inference, which is its own discipline of GPU cost optimization.
How to track and control token costs
Tokens are controllable once you can see them. A few practices do most of the work.
Trim the input: Send only the context a call needs. Shorter system prompts and tighter retrieval cut input tokens on every request.
Cap the output: Set sensible maximum response lengths so a single call cannot run away.
Reuse repeated context: When the same long system prompt or static knowledge appears in many calls, prompt caching reuses the stored prefix and sharply cuts the cost of sending the same context again.
Watch usage continuously: Strong LLM observability turns token spend from a monthly surprise into a live metric you can act on.
Alert on spikes: Pairing usage data with cost anomaly detection catches a runaway prompt loop before it reaches the invoice and cost forecasting keeps the trend in view.
Conclusion
A token is the unit AI uses to read, write and bill. It is a fragment of text, usually about four characters, that a model converts into numbers and processes one step at a time. Words help you read and characters help you type, but tokens are the unit that decides both how a model understands language and how much you pay to use it.
For teams running AI in production, that second point is the one that compounds. Input and output tokens carry different prices, context and volume push counts higher and small per-call costs add up fast. Treating tokens as a managed cost, with the right FinOps tools for AI cost management and a clear FinOps practice behind them, is what keeps an AI roadmap affordable as it scales.
FAQs
What is a token in AI?
A token is the smallest unit of text an AI model processes. It can be a word, part of a word, a character, or punctuation. Models read and generate text token by token and providers bill per token.
How many tokens are in a word?
In English, one token is roughly four characters or about three-quarters of a word. So 1,000 words land near 1,300 to 1,500 tokens, though code and non-Latin scripts run higher.
Is an AI token the same as a crypto token?
No. A crypto token is a tradable digital asset. An AI token is a slice of text a language model reads and generates. They share a name but are unrelated.
Why do output tokens cost more than input tokens?
Output is generated one token at a time in sequence, which is slower than reading input that can be processed in parallel. That extra compute makes output tokens cost three to eight times more.
How do I count tokens before sending a prompt?
Use a tokenizer tool or a library like tiktoken for your specific model. It returns the exact token count so you can check the model limit and estimate cost before the call.
Why does my AI bill depend on tokens?
Providers charge per token, so cost rises with longer prompts, longer responses and higher call volume. Tracking token usage by feature or team is how you keep AI spend predictable.
FinOps OS powered by context-aware AI agents.
Start with a 30-day no-cost trial.
Read-only.
No credit card.
No commitment.
Want to assess how your FinOps journey can scale?
Benchmark maturity, close governance gaps, and drive ROI in under 20 minutes

Recommended Articles

What is On-Demand Computing (ODC)?
Read More

30+ Best DevOps Tools for 2026 (by Category)
Read More

What Is a Cloud Gateway? Types, How It Works and What It Costs
Read More

What Is a Network Gateway? Types, Functions and Real-World Use Cases
Read More

What Is Platform as a Service (PaaS) in Cloud Computing?
Read More

AWS vs Azure: A Buyer's Comparison for Architects and FinOps Teams
Read More






