Azure OpenAI Pricing 2026: Models, PTU & Hidden Costs
14 min read
FinOps for AI

Table of Contents
Azure OpenAI pricing is a per-token, consumption-based model that bills separately for input tokens, output tokens, deployment type and a stack of non-token line items most calculators ignore. The problem: token rates match OpenAI's direct API to the cent, but Azure bills land 15-40% higher because of support plans, egress, fine-tune hosting and Log Analytics ingest.
This guide solves three problems teams hit on Azure OpenAI:
What does each model actually cost in 2026?
When is Provisioned Throughput (PTU) cheaper than Pay-As-You-Go?
Why is the invoice always higher than the calculator predicted?
Every dollar figure below is sourced from the Azure OpenAI Service Pricing page and the Microsoft Learn cost management docs. For the broader category, start with our FinOps tools for AI cost management breakdown.
Why Azure OpenAI costs feel unpredictable
Azure OpenAI costs feel unpredictable because the unit of consumption (the token) is not knowable before the request runs. A VM bills by the hour. A storage bucket bills by the gigabyte. Both are countable in advance. Tokens are not.
Five reasons AI cost behaves differently from normal cloud cost:
Input size varies per request. One user message can be 80 or 8,000 tokens depending on system prompt, retrieved context and conversation history.
Output size is non-deterministic. The same prompt can return 200 tokens or 1,200 tokens.
Retries are billable. A failed call still consumes prompt tokens upstream.
Context windows compound. A 50-turn chat on GPT-4.1 can cost 30x the first turn.
Fine-tunes do not scale to zero. A deployed fine-tune bills hourly whether traffic is zero or one million requests.
The fine-tune line is the single most expensive surprise teams hit in their first quarter.
Azure OpenAI pricing by model (2026 rate card)
Azure OpenAI charges per million input tokens and per million output tokens, with output priced 3-4x higher than input on most models. All rates below are Microsoft's published Pay-As-You-Go prices for Global Standard deployments. Regional Standard runs 5-10% higher. Cached input tokens get an automatic 50-90% discount on repeated prefixes.
GPT-5 Model pricing
Model | Input / 1M tokens | Output / 1M tokens | Best fit |
GPT-5 | $5.00 | $20.00 | Complex reasoning, multi-step agents |
GPT-5-mini | $0.50 | $2.00 | Balanced production workloads |
GPT-5-nano | $0.05 | $0.40 | High-volume classification, routing |
GPT-4.1 Model pricing
Model | Input / 1M | Output / 1M | Best fit |
GPT-4.1 | $2.00 | $8.00 | Long context, up to 1M tokens |
GPT-4.1-mini | $0.40 | $1.60 | Default workhorse for chatbots |
GPT-4.1-nano | $0.10 | $0.40 | Tagging, extraction, batch jobs |
GPT-4o and o-series pricing
Model | Input / 1M | Output / 1M |
GPT-4o | $2.50 | $10.00 |
GPT-4o-mini | $0.15 | $0.60 |
o3 | $10.00 | $40.00 |
o4-mini | $1.10 | $4.40 |
Embeddings, image and audio pricing
Service | Rate |
text-embedding-3-small | $0.02 / 1M input tokens |
text-embedding-3-large | $0.13 / 1M input tokens |
Whisper (speech-to-text) | $0.006 / minute |
TTS standard | $15.00 / 1M characters |
TTS HD | $30.00 / 1M characters |
DALL-E 3, 1024x1024 standard | $0.040 / image |
DALL-E 3, 1024x1024 HD | $0.080 / image |
Three rules for reading the rate card:
Output is always more expensive than input. Cap output tokens to compress bills.
Nano and mini tiers are 10x to 50x cheaper than flagship models. Route simple tasks down.
Embeddings are roughly 100x cheaper than chat models. RAG retrieval is rarely the cost driver.
Azure OpenAI deployment types: which one is cheapest for your workload
Azure OpenAI offers five deployment types and choosing the wrong one is the single biggest reason teams overpay. The model is identical across types. The bill is not.
Deployment | Pricing model | Discount vs Standard | When to pick it |
Standard (regional PAYG) | Per-token, single region | Baseline | Strict single-region residency |
Global Standard | Per-token, routed globally | Same or lower than Standard | No residency requirement |
Data Zone Standard | Per-token, EU or US zone | ~5% over Global | Mid-tier residency (EU/US block) |
Provisioned (PTU) | Per PTU-hour, reserved | Up to 70% on steady workloads | Predictable production traffic |
Batch API | Per-token, 24-hr SLA | 50% off | Async work, evals, backfills |
When is PTU cheaper than Pay-As-You-Go?
PTU is cheaper than Pay-As-You-Go when sustained utilization exceeds 50% and monthly token volume is above 150-200M tokens on GPT-4o. Below that line, PAYG wins on flexibility and total cost.
PTU pricing starts at roughly $2,448 per month per unit. Annual commitments save another ~35% versus monthly. Decision rule:
Below 150M tokens/month: stay on PAYG.
150M to 500M tokens/month: model both. PTU usually wins if utilization is steady.
Above 500M tokens/month with steady traffic: PTU is almost always cheaper and gives predictable tail latency.
What practitioners report: "Do not commit to PTU based on calculator estimates. Run Pay-As-You-Go for 30-60 days, measure P95 hourly token throughput, then model PTU against the real numbers. Teams that skip this end up with reserved capacity at 12% utilization and a contract they cannot exit."
For the modeling layer behind a PTU decision, see cloud cost forecasting tools.
Cached input and Batch API: the two free wins
Cached input tokens fire automatically when a prompt prefix repeats above a minimum length. Discount: 50-90% on the cached portion. Best candidates: system prompts, persona definitions, stable RAG framings.
Batch API runs jobs within 24 hours for 50% off. Best fit: evals, nightly summarization, classification on a queue, embedding refreshes.
The seven cost drivers behind every Azure OpenAI bill
Seven line items account for 90% of an Azure OpenAI invoice. Knowing which one is dominant tells you which lever to pull first.
1. Tokens (input and output)
Output tokens cost 3-4x input tokens on most models. Concise prompts that bias the model toward shorter answers save real money.
2. Model choice
The gap between GPT-5-nano and GPT-5 is 100x on input and 50x on output. Routing simple tasks to nano-class models is the highest-leverage decision in the stack.
3. Embeddings and RAG
Cheap per call but volume adds up. Re-embedding the same corpus on every deploy is a common waste pattern. Embed once, store, refresh only on content change.
4. Networking and egress
Cross-region calls bill at $0.087/GB after a 100GB free allowance. Cross-cloud egress (Azure inference into AWS or GCP) is where this explodes.
5. Retries and timeouts
A 500 error still consumed your prompt tokens. Aggressive retry policies can double the input bill without delivering a single extra response.
6. Observability overhead
Log Analytics ingestion is roughly $2.30/GB. Logging every prompt and completion at full fidelity can match the inference bill on chatty workloads.
7. Fine-tuned model hosting (the zombie line item)
Fine-tuned deployments bill hourly as long as they exist, regardless of traffic. There is no scale-to-zero. There is no idle discount. This is where teams lose four-figure and five-figure amounts without noticing.
Hourly rate: $1.70-$3.00/hour depending on base model
Daily cost of one idle deployment: $50-$70
Six months of we will get back to it: $9,000-$12,000 burned on zero requests
Fix: run a fortnightly audit. List every deployment, count requests served in the last 14 days, decommission anything at zero unless an owner can justify the hold. See Azure cost optimization tools for the broader Azure playbook.
Why is Azure OpenAI 15-40% more expensive than the direct OpenAI API?
Azure OpenAI is 15-40% more expensive than direct OpenAI because Azure layers six auxiliary costs on top of identical token rates: support, egress, fine-tune hosting, Private Link, Log Analytics and file storage. Token prices themselves match to the cent.
Hidden line item | Typical range | What triggers it |
Support plan | $100-$1,000+/month | Production SLA tier |
Data egress | $0.087/GB after 100GB free | Cross-region or cross-cloud calls |
Log Analytics ingest | ~$2.30/GB | Default verbose logging |
File search storage | $0.10/GB/day (1GB free) | Assistants API + RAG |
Private Link | Per endpoint-hour | Network isolation requirements |
Fine-tuned model hosting | $1.70-$3.00/hour | One deployed fine-tune |
What practitioners report: A mid-size team modeling 50M GPT-4o tokens/month expected $312.50. The first invoice landed at $485.30, a 55% overshoot once support, egress, storage and Log Analytics were folded in. An enterprise modeling $2,200 closed the month at $7,452 because three idle fine-tunes had been sitting deployed for a quarter.
If you stay on Azure anyway, the reasons are compliance, data residency, or enterprise procurement, not price. Map the trade-off across your stack with multi-cloud cost reporting tools. For regulated workloads, use the patterns in our fintech cloud cost optimization tools guide.
How to calculate cost per outcome on Azure OpenAI
Cost per outcome is the right unit for an executive conversation, not cost per token. Pick the outcome that matters (per ticket resolved, per document classified, per code review delivered) and measure cost against it.
Worked example: a support copilot on GPT-4o
Inputs for one ticket:
1,200-token system prompt
600 tokens of retrieved context
200-token user message
350-token model answer
Token math on GPT-4o:
Input: 2,000 tokens x $2.50/1M = $0.005
Output: 350 tokens x $10.00/1M = $0.0035
Cost per ticket: ~$0.0085
Scaled to 40,000 tickets/month:
Token cost: $340
Support plan: $300
Log Analytics: $60
Egress for embedding refresh: $40
Real monthly cost: ~$740 (2.2x the token math)
The token math is correct. The bill is still more than double. Surface this in dashboards rather than discovering it on the invoice with cloud cost visibility software.
What to measure before you start optimizing
Optimization without measurement is guessing. The minimum instrumentation set:
Tokens per request, split by input and output. Histograms, not averages.
Cost per feature, not per model. A model is shared; a feature is owned.
Cache hit rate on cached input. Below 60% on a stable system prompt means the prompt template is broken.
Retry rate per endpoint. Above 5% is silent overspend.
Fine-tune deployment inventory with last-request timestamps. Weekly.
P95 hourly token throughput. Required input for any PTU decision.
Cost per outcome. Per ticket, per doc, per call. The number you take to finance.
For the dashboards that hold this together, see cloud cost intelligence tools.
Where Azure cost visibility usually breaks
Three patterns repeat in every Azure OpenAI cost review:
Tagging stops at the resource group. Deployments inside resource groups often inherit no feature-level tag, so cost attribution stops at the team boundary.
Fine-tune costs hide in model hosting. They land in a different cost view than per-token charges. Many teams never see them in their primary dashboard.
Log Analytics ingest is billed as monitoring, not as AI cost. Verbose logging defaults ship to production and stay there.
Fix: a flat cost view that joins token spend, fine-tune hosting and observability under the same product/feature tag. The category to evaluate here is cloud cost allocation software.
Eight ways to reduce Azure OpenAI spend without slowing the product
Ranked by leverage, not effort.
1. Route by task complexity, not by team preference
Default everything to GPT-4.1-mini or GPT-5-mini. Promote to flagship only when an eval shows quality degradation. Most chat traffic does not need a flagship model.
2. Enable cached input on every stable system prompt
Anywhere a prefix repeats, the cache fires automatically with 50-90% off. Restructure templates so stable content sits at the front of the prompt.
3. Move async work to the Batch API
50% off, 24-hour SLA. Backfills, evals, classification queues, embedding refreshes. Integration cost is one endpoint and a job poller.
4. Cap output tokens per request
Most chat answers do not need a 4,000-token budget. Cap at 600-800. Tighter answers, smaller bill.
5. Run a fortnightly fine-tune zombie audit
List deployments, join request counts from the last 14 days, decommission zero-request models without an owner override. Returns 5-6 figures a year for teams that have experimented with fine-tuning.
6. Compress system prompts
A 2,000-token system prompt sent on every request is a $0.005 tax on every call. Cut to 800 tokens and save 60% of prompt cost.
7. Bound retries with exponential backoff
Default SDK policies retry 3-5 times. Cap at 2 with backoff. A flapping endpoint can otherwise 3x the input bill.
8. Throttle Log Analytics ingestion
Sample prompts and completions at 5-10% in production. Full-fidelity logging is a dev-time tool, not a production posture.
For deeper tactic coverage across your full cloud footprint, see our cloud cost optimization guide.
How to forecast Azure OpenAI spend accurately
Accurate Azure OpenAI forecasting needs three inputs:
Token forecast, not request forecast. One request can vary 10x in tokens.
Model mix forecast. A 20% traffic shift from GPT-4.1-mini to GPT-4o can double the bill.
Non-token overhead as a fixed 20-40% on top of token spend, sized to your support tier and logging posture.
Reality-check monthly:
Variance over 15%: traffic is mis-modeled or model routing changed.
Variance over 30%: look for fine-tune zombies or runaway retries.
For the toolset that holds the forecasting loop together, see cloud cost budgeting software.
Governance that does not kill momentum
Governance works when it ships dashboards, not memos. Three artifacts are non-negotiable:
A token-per-feature view by service, refreshed daily.
A fine-tune deployment inventory with owners and last-request timestamps, refreshed weekly.
Per-team budgets with anomaly alerts wired into the same Slack or Teams channel where engineers already work.
Three habits keep governance alive:
Engineers see their feature cost on the same dashboard as error rate and latency. Cost becomes a normal engineering signal.
Anomaly alerts are owned by the team that triggered them, not by a central FinOps function.
Quarterly reviews look at cost per outcome, not cost per token.
For the platform layer, evaluate cloud financial management software. For alert wiring specifically, see cloud cost anomaly detection tools.
A five-day checklist to cut your Azure OpenAI bill
Most teams reclaim 15-30% of their Azure OpenAI bill from this audit alone.
Decommission every fine-tuned deployment with zero requests in the last 14 days.
Pull cache hit rate on your top three system prompts. Restructure if below 60%.
Move async workloads from Standard to the Batch API.
Cap output tokens at 600-800 on chat endpoints. Re-run evals.
Cap retries at 2 with exponential backoff.
Sample Log Analytics at 5-10% in production.
Tag every Azure OpenAI deployment with a feature-level tag.
Calculate cost per outcome on your top three AI features. Put it on a dashboard.
Pull 30 days of P95 hourly token throughput. Decide PTU vs PAYG against real data.
Forecast next month with 20-40% non-token overhead included.
FAQ: Azure OpenAI pricing
What is Azure OpenAI pricing in 2026?
Azure OpenAI pricing in 2026 uses two models: Pay-As-You-Go bills per million input and output tokens (GPT-4o is $2.50 input and $10.00 output per 1M tokens) and Provisioned Throughput Units (PTU) bill per reserved capacity hour starting around $2,448/month per unit with up to 70% savings on sustained workloads.
What is the cheapest Azure OpenAI model?
GPT-5-nano is the cheapest Azure OpenAI model at $0.05 input and $0.40 output per 1M tokens, followed by GPT-4.1-nano at $0.10 and $0.40. Both are designed for high-volume, simple tasks like classification, tagging and routing.
Is Azure OpenAI free to try?
Azure OpenAI has no permanent free tier. Some Azure subscriptions include trial credit that covers initial calls, but ongoing usage is billed per token from the first production request.
Is Azure OpenAI cheaper than the direct OpenAI API?
No. Token prices on Azure OpenAI are identical to OpenAI's direct API. Azure adds 15-40% on top through support plans, data egress, fine-tune hosting, Private Link and Log Analytics. Teams choose Azure for compliance and procurement, not price.
How does cached input pricing work on Azure OpenAI?
Cached input applies a 50-90% discount automatically when a prompt prefix repeats across requests above a minimum length. No code changes are required; the cache fires on prefix match.
When should I switch from Pay-As-You-Go to PTU?
Switch from Pay-As-You-Go to PTU when monthly GPT-4o token volume exceeds 150-200M with 50% or better sustained utilization. Validate against 30-60 days of real PAYG telemetry before committing.
What are PTUs in Azure OpenAI?
Provisioned Throughput Units (PTUs) are reserved model capacity on Azure OpenAI. You pay per PTU-hour regardless of tokens processed. PTU delivers predictable latency and up to 70% savings on sustained workloads. Annual commitments save another ~35% versus monthly.
How is Azure OpenAI billing calculated?
Azure OpenAI billing combines per-token charges, per-hour fine-tune hosting, per-image DALL-E charges, per-minute Whisper charges, per-character TTS charges and auxiliary line items for support, egress, storage and observability.
What are the hidden costs of Azure OpenAI?
The six hidden costs that inflate Azure OpenAI bills 15-40%: support plans ($100-$1,000+/month), data egress ($0.087/GB after 100GB free), fine-tuned model hosting ($1.70-$3.00/hour), file search storage ($0.10/GB/day), Log Analytics ingest (~$2.30/GB) and Private Link endpoint hours.
What is the Azure OpenAI Batch API?
The Azure OpenAI Batch API is an asynchronous endpoint that runs jobs within 24 hours at 50% off the Global Standard rate. It fits evals, backfills, nightly summarization, classification queues and embedding refreshes.
How do I track Azure OpenAI costs across teams?
Track Azure OpenAI costs across teams by tagging every deployment with a feature-level identifier and joining token spend, fine-tune hosting and Log Analytics under the same tag for a single cost-per-feature view.
How do I forecast Azure OpenAI spend?
Forecast Azure OpenAI spend by projecting tokens per feature, weighting by model mix, then adding 20-40% for non-token overhead. Reality-check monthly; variance above 30% usually means a fine-tune zombie or a retry storm.
Final word
Azure OpenAI pricing is not complicated; it is layered. Token rates are the headline. Deployment type, cached input, Batch routing, fine-tune hygiene and the six non-token line items are where the real bill is decided.
Teams that run Azure OpenAI cheaply did three things on day one: instrumented cost per outcome, audited fine-tunes every two weeks and capped retries before production launch. Start with our cloud cost management guide to put the same discipline around the rest of your Azure footprint.
FinOps OS powered by context-aware AI agents.
Start with a 30-day no-cost trial.
Read-only.
No credit card.
No commitment.
Want to assess how your FinOps journey can scale?
Benchmark maturity, close governance gaps, and drive ROI in under 20 minutes

Recommended Articles

GPU for AI Training: Pick the Right One Without Overspending
Read More

OpenAI API Pricing Explained: How to Estimate and Control Your Token Costs
Read More

Why Your AI Workloads Are Bleeding Money (And How to Finally Stop It)
Read More

Model Context Protocol: The Open Standard for AI-Driven FinOps
Read More

FinOps for AI: Best Practices, KPIs and Metrics
Read More






