Azure OpenAI Pricing 2026: Models, PTU & Hidden Costs

14 min read

Amnic

Amnic

FinOps for AI

FinOps for AI: Understanding the True Cost of Azure OpenAI

Table of Contents

No headings found on page

Azure OpenAI pricing is a per-token, consumption-based model that bills separately for input tokens, output tokens, deployment type and a stack of non-token line items most calculators ignore. The problem: token rates match OpenAI's direct API to the cent, but Azure bills land 15-40% higher because of support plans, egress, fine-tune hosting and Log Analytics ingest.

This guide solves three problems teams hit on Azure OpenAI:

  • What does each model actually cost in 2026?

  • When is Provisioned Throughput (PTU) cheaper than Pay-As-You-Go?

  • Why is the invoice always higher than the calculator predicted?

Every dollar figure below is sourced from the Azure OpenAI Service Pricing page and the Microsoft Learn cost management docs. For the broader category, start with our FinOps tools for AI cost management breakdown.

Why Azure OpenAI costs feel unpredictable

Azure OpenAI costs feel unpredictable because the unit of consumption (the token) is not knowable before the request runs. A VM bills by the hour. A storage bucket bills by the gigabyte. Both are countable in advance. Tokens are not.

Five reasons AI cost behaves differently from normal cloud cost:

  • Input size varies per request. One user message can be 80 or 8,000 tokens depending on system prompt, retrieved context and conversation history.

  • Output size is non-deterministic. The same prompt can return 200 tokens or 1,200 tokens.

  • Retries are billable. A failed call still consumes prompt tokens upstream.

  • Context windows compound. A 50-turn chat on GPT-4.1 can cost 30x the first turn.

  • Fine-tunes do not scale to zero. A deployed fine-tune bills hourly whether traffic is zero or one million requests.

The fine-tune line is the single most expensive surprise teams hit in their first quarter.

Azure OpenAI pricing by model (2026 rate card)

Azure OpenAI charges per million input tokens and per million output tokens, with output priced 3-4x higher than input on most models. All rates below are Microsoft's published Pay-As-You-Go prices for Global Standard deployments. Regional Standard runs 5-10% higher. Cached input tokens get an automatic 50-90% discount on repeated prefixes.

GPT-5 Model pricing

Model

Input / 1M tokens

Output / 1M tokens

Best fit

GPT-5

$5.00

$20.00

Complex reasoning, multi-step agents

GPT-5-mini

$0.50

$2.00

Balanced production workloads

GPT-5-nano

$0.05

$0.40

High-volume classification, routing

GPT-4.1 Model pricing

Model

Input / 1M

Output / 1M

Best fit

GPT-4.1

$2.00

$8.00

Long context, up to 1M tokens

GPT-4.1-mini

$0.40

$1.60

Default workhorse for chatbots

GPT-4.1-nano

$0.10

$0.40

Tagging, extraction, batch jobs

GPT-4o and o-series pricing

Model

Input / 1M

Output / 1M

GPT-4o

$2.50

$10.00

GPT-4o-mini

$0.15

$0.60

o3

$10.00

$40.00

o4-mini

$1.10

$4.40

Embeddings, image and audio pricing

Service

Rate

text-embedding-3-small

$0.02 / 1M input tokens

text-embedding-3-large

$0.13 / 1M input tokens

Whisper (speech-to-text)

$0.006 / minute

TTS standard

$15.00 / 1M characters

TTS HD

$30.00 / 1M characters

DALL-E 3, 1024x1024 standard

$0.040 / image

DALL-E 3, 1024x1024 HD

$0.080 / image

Three rules for reading the rate card:

  • Output is always more expensive than input. Cap output tokens to compress bills.

  • Nano and mini tiers are 10x to 50x cheaper than flagship models. Route simple tasks down.

  • Embeddings are roughly 100x cheaper than chat models. RAG retrieval is rarely the cost driver.

Azure OpenAI deployment types: which one is cheapest for your workload

Azure OpenAI offers five deployment types and choosing the wrong one is the single biggest reason teams overpay. The model is identical across types. The bill is not.

Deployment

Pricing model

Discount vs Standard

When to pick it

Standard (regional PAYG)

Per-token, single region

Baseline

Strict single-region residency

Global Standard

Per-token, routed globally

Same or lower than Standard

No residency requirement

Data Zone Standard

Per-token, EU or US zone

~5% over Global

Mid-tier residency (EU/US block)

Provisioned (PTU)

Per PTU-hour, reserved

Up to 70% on steady workloads

Predictable production traffic

Batch API

Per-token, 24-hr SLA

50% off

Async work, evals, backfills

When is PTU cheaper than Pay-As-You-Go?

PTU is cheaper than Pay-As-You-Go when sustained utilization exceeds 50% and monthly token volume is above 150-200M tokens on GPT-4o. Below that line, PAYG wins on flexibility and total cost.

PTU pricing starts at roughly $2,448 per month per unit. Annual commitments save another ~35% versus monthly. Decision rule:

  • Below 150M tokens/month: stay on PAYG.

  • 150M to 500M tokens/month: model both. PTU usually wins if utilization is steady.

  • Above 500M tokens/month with steady traffic: PTU is almost always cheaper and gives predictable tail latency.

What practitioners report: "Do not commit to PTU based on calculator estimates. Run Pay-As-You-Go for 30-60 days, measure P95 hourly token throughput, then model PTU against the real numbers. Teams that skip this end up with reserved capacity at 12% utilization and a contract they cannot exit."

For the modeling layer behind a PTU decision, see cloud cost forecasting tools.

Cached input and Batch API: the two free wins

  • Cached input tokens fire automatically when a prompt prefix repeats above a minimum length. Discount: 50-90% on the cached portion. Best candidates: system prompts, persona definitions, stable RAG framings.

  • Batch API runs jobs within 24 hours for 50% off. Best fit: evals, nightly summarization, classification on a queue, embedding refreshes.

The seven cost drivers behind every Azure OpenAI bill

Seven line items account for 90% of an Azure OpenAI invoice. Knowing which one is dominant tells you which lever to pull first.

1. Tokens (input and output)

Output tokens cost 3-4x input tokens on most models. Concise prompts that bias the model toward shorter answers save real money.

2. Model choice

The gap between GPT-5-nano and GPT-5 is 100x on input and 50x on output. Routing simple tasks to nano-class models is the highest-leverage decision in the stack.

3. Embeddings and RAG

Cheap per call but volume adds up. Re-embedding the same corpus on every deploy is a common waste pattern. Embed once, store, refresh only on content change.

4. Networking and egress

Cross-region calls bill at $0.087/GB after a 100GB free allowance. Cross-cloud egress (Azure inference into AWS or GCP) is where this explodes.

5. Retries and timeouts

A 500 error still consumed your prompt tokens. Aggressive retry policies can double the input bill without delivering a single extra response.

6. Observability overhead

Log Analytics ingestion is roughly $2.30/GB. Logging every prompt and completion at full fidelity can match the inference bill on chatty workloads.

7. Fine-tuned model hosting (the zombie line item)

Fine-tuned deployments bill hourly as long as they exist, regardless of traffic. There is no scale-to-zero. There is no idle discount. This is where teams lose four-figure and five-figure amounts without noticing.

  • Hourly rate: $1.70-$3.00/hour depending on base model

  • Daily cost of one idle deployment: $50-$70

  • Six months of we will get back to it: $9,000-$12,000 burned on zero requests

Fix: run a fortnightly audit. List every deployment, count requests served in the last 14 days, decommission anything at zero unless an owner can justify the hold. See Azure cost optimization tools for the broader Azure playbook.

Why is Azure OpenAI 15-40% more expensive than the direct OpenAI API?

Azure OpenAI is 15-40% more expensive than direct OpenAI because Azure layers six auxiliary costs on top of identical token rates: support, egress, fine-tune hosting, Private Link, Log Analytics and file storage. Token prices themselves match to the cent.

Hidden line item

Typical range

What triggers it

Support plan

$100-$1,000+/month

Production SLA tier

Data egress

$0.087/GB after 100GB free

Cross-region or cross-cloud calls

Log Analytics ingest

~$2.30/GB

Default verbose logging

File search storage

$0.10/GB/day (1GB free)

Assistants API + RAG

Private Link

Per endpoint-hour

Network isolation requirements

Fine-tuned model hosting

$1.70-$3.00/hour

One deployed fine-tune

What practitioners report: A mid-size team modeling 50M GPT-4o tokens/month expected $312.50. The first invoice landed at $485.30, a 55% overshoot once support, egress, storage and Log Analytics were folded in. An enterprise modeling $2,200 closed the month at $7,452 because three idle fine-tunes had been sitting deployed for a quarter.

If you stay on Azure anyway, the reasons are compliance, data residency, or enterprise procurement, not price. Map the trade-off across your stack with multi-cloud cost reporting tools. For regulated workloads, use the patterns in our fintech cloud cost optimization tools guide.

How to calculate cost per outcome on Azure OpenAI

Cost per outcome is the right unit for an executive conversation, not cost per token. Pick the outcome that matters (per ticket resolved, per document classified, per code review delivered) and measure cost against it.

Worked example: a support copilot on GPT-4o

Inputs for one ticket:

  • 1,200-token system prompt

  • 600 tokens of retrieved context

  • 200-token user message

  • 350-token model answer

Token math on GPT-4o:

  • Input: 2,000 tokens x $2.50/1M = $0.005

  • Output: 350 tokens x $10.00/1M = $0.0035

  • Cost per ticket: ~$0.0085

Scaled to 40,000 tickets/month:

  • Token cost: $340

  • Support plan: $300

  • Log Analytics: $60

  • Egress for embedding refresh: $40

  • Real monthly cost: ~$740 (2.2x the token math)

The token math is correct. The bill is still more than double. Surface this in dashboards rather than discovering it on the invoice with cloud cost visibility software.

What to measure before you start optimizing

Optimization without measurement is guessing. The minimum instrumentation set:

  • Tokens per request, split by input and output. Histograms, not averages.

  • Cost per feature, not per model. A model is shared; a feature is owned.

  • Cache hit rate on cached input. Below 60% on a stable system prompt means the prompt template is broken.

  • Retry rate per endpoint. Above 5% is silent overspend.

  • Fine-tune deployment inventory with last-request timestamps. Weekly.

  • P95 hourly token throughput. Required input for any PTU decision.

  • Cost per outcome. Per ticket, per doc, per call. The number you take to finance.

For the dashboards that hold this together, see cloud cost intelligence tools.

Where Azure cost visibility usually breaks

Three patterns repeat in every Azure OpenAI cost review:

  1. Tagging stops at the resource group. Deployments inside resource groups often inherit no feature-level tag, so cost attribution stops at the team boundary.

  2. Fine-tune costs hide in model hosting. They land in a different cost view than per-token charges. Many teams never see them in their primary dashboard.

  3. Log Analytics ingest is billed as monitoring, not as AI cost. Verbose logging defaults ship to production and stay there.

Fix: a flat cost view that joins token spend, fine-tune hosting and observability under the same product/feature tag. The category to evaluate here is cloud cost allocation software.

Eight ways to reduce Azure OpenAI spend without slowing the product

Ranked by leverage, not effort.

1. Route by task complexity, not by team preference

Default everything to GPT-4.1-mini or GPT-5-mini. Promote to flagship only when an eval shows quality degradation. Most chat traffic does not need a flagship model.

2. Enable cached input on every stable system prompt

Anywhere a prefix repeats, the cache fires automatically with 50-90% off. Restructure templates so stable content sits at the front of the prompt.

3. Move async work to the Batch API

50% off, 24-hour SLA. Backfills, evals, classification queues, embedding refreshes. Integration cost is one endpoint and a job poller.

4. Cap output tokens per request

Most chat answers do not need a 4,000-token budget. Cap at 600-800. Tighter answers, smaller bill.

5. Run a fortnightly fine-tune zombie audit

List deployments, join request counts from the last 14 days, decommission zero-request models without an owner override. Returns 5-6 figures a year for teams that have experimented with fine-tuning.

6. Compress system prompts

A 2,000-token system prompt sent on every request is a $0.005 tax on every call. Cut to 800 tokens and save 60% of prompt cost.

7. Bound retries with exponential backoff

Default SDK policies retry 3-5 times. Cap at 2 with backoff. A flapping endpoint can otherwise 3x the input bill.

8. Throttle Log Analytics ingestion

Sample prompts and completions at 5-10% in production. Full-fidelity logging is a dev-time tool, not a production posture.

For deeper tactic coverage across your full cloud footprint, see our cloud cost optimization guide.

How to forecast Azure OpenAI spend accurately

Accurate Azure OpenAI forecasting needs three inputs:

  • Token forecast, not request forecast. One request can vary 10x in tokens.

  • Model mix forecast. A 20% traffic shift from GPT-4.1-mini to GPT-4o can double the bill.

  • Non-token overhead as a fixed 20-40% on top of token spend, sized to your support tier and logging posture.

Reality-check monthly:

  • Variance over 15%: traffic is mis-modeled or model routing changed.

  • Variance over 30%: look for fine-tune zombies or runaway retries.

For the toolset that holds the forecasting loop together, see cloud cost budgeting software.

Governance that does not kill momentum

Governance works when it ships dashboards, not memos. Three artifacts are non-negotiable:

  • A token-per-feature view by service, refreshed daily.

  • A fine-tune deployment inventory with owners and last-request timestamps, refreshed weekly.

  • Per-team budgets with anomaly alerts wired into the same Slack or Teams channel where engineers already work.

Three habits keep governance alive:

  • Engineers see their feature cost on the same dashboard as error rate and latency. Cost becomes a normal engineering signal.

  • Anomaly alerts are owned by the team that triggered them, not by a central FinOps function.

  • Quarterly reviews look at cost per outcome, not cost per token.

For the platform layer, evaluate cloud financial management software. For alert wiring specifically, see cloud cost anomaly detection tools.

A five-day checklist to cut your Azure OpenAI bill

Most teams reclaim 15-30% of their Azure OpenAI bill from this audit alone.

  • Decommission every fine-tuned deployment with zero requests in the last 14 days.

  • Pull cache hit rate on your top three system prompts. Restructure if below 60%.

  • Move async workloads from Standard to the Batch API.

  • Cap output tokens at 600-800 on chat endpoints. Re-run evals.

  • Cap retries at 2 with exponential backoff.

  • Sample Log Analytics at 5-10% in production.

  • Tag every Azure OpenAI deployment with a feature-level tag.

  • Calculate cost per outcome on your top three AI features. Put it on a dashboard.

  • Pull 30 days of P95 hourly token throughput. Decide PTU vs PAYG against real data.

  • Forecast next month with 20-40% non-token overhead included.

FAQ: Azure OpenAI pricing

What is Azure OpenAI pricing in 2026?

Azure OpenAI pricing in 2026 uses two models: Pay-As-You-Go bills per million input and output tokens (GPT-4o is $2.50 input and $10.00 output per 1M tokens) and Provisioned Throughput Units (PTU) bill per reserved capacity hour starting around $2,448/month per unit with up to 70% savings on sustained workloads.

What is the cheapest Azure OpenAI model?

GPT-5-nano is the cheapest Azure OpenAI model at $0.05 input and $0.40 output per 1M tokens, followed by GPT-4.1-nano at $0.10 and $0.40. Both are designed for high-volume, simple tasks like classification, tagging and routing.

Is Azure OpenAI free to try?

Azure OpenAI has no permanent free tier. Some Azure subscriptions include trial credit that covers initial calls, but ongoing usage is billed per token from the first production request.

Is Azure OpenAI cheaper than the direct OpenAI API?

No. Token prices on Azure OpenAI are identical to OpenAI's direct API. Azure adds 15-40% on top through support plans, data egress, fine-tune hosting, Private Link and Log Analytics. Teams choose Azure for compliance and procurement, not price.

How does cached input pricing work on Azure OpenAI?

Cached input applies a 50-90% discount automatically when a prompt prefix repeats across requests above a minimum length. No code changes are required; the cache fires on prefix match.

When should I switch from Pay-As-You-Go to PTU?

Switch from Pay-As-You-Go to PTU when monthly GPT-4o token volume exceeds 150-200M with 50% or better sustained utilization. Validate against 30-60 days of real PAYG telemetry before committing.

What are PTUs in Azure OpenAI?

Provisioned Throughput Units (PTUs) are reserved model capacity on Azure OpenAI. You pay per PTU-hour regardless of tokens processed. PTU delivers predictable latency and up to 70% savings on sustained workloads. Annual commitments save another ~35% versus monthly.

How is Azure OpenAI billing calculated?

Azure OpenAI billing combines per-token charges, per-hour fine-tune hosting, per-image DALL-E charges, per-minute Whisper charges, per-character TTS charges and auxiliary line items for support, egress, storage and observability.

What are the hidden costs of Azure OpenAI?

The six hidden costs that inflate Azure OpenAI bills 15-40%: support plans ($100-$1,000+/month), data egress ($0.087/GB after 100GB free), fine-tuned model hosting ($1.70-$3.00/hour), file search storage ($0.10/GB/day), Log Analytics ingest (~$2.30/GB) and Private Link endpoint hours.

What is the Azure OpenAI Batch API?

The Azure OpenAI Batch API is an asynchronous endpoint that runs jobs within 24 hours at 50% off the Global Standard rate. It fits evals, backfills, nightly summarization, classification queues and embedding refreshes.

How do I track Azure OpenAI costs across teams?

Track Azure OpenAI costs across teams by tagging every deployment with a feature-level identifier and joining token spend, fine-tune hosting and Log Analytics under the same tag for a single cost-per-feature view.

How do I forecast Azure OpenAI spend?

Forecast Azure OpenAI spend by projecting tokens per feature, weighting by model mix, then adding 20-40% for non-token overhead. Reality-check monthly; variance above 30% usually means a fine-tune zombie or a retry storm.

Final word

Azure OpenAI pricing is not complicated; it is layered. Token rates are the headline. Deployment type, cached input, Batch routing, fine-tune hygiene and the six non-token line items are where the real bill is decided.

Teams that run Azure OpenAI cheaply did three things on day one: instrumented cost per outcome, audited fine-tunes every two weeks and capped retries before production launch. Start with our cloud cost management guide to put the same discipline around the rest of your Azure footprint.

FinOps OS powered by context-aware AI agents.

Start with a 30-day no-cost trial.

Read-only.

No credit card.

No commitment.

Want to assess how your FinOps journey can scale?

Benchmark maturity, close governance gaps, and drive ROI in under 20 minutes

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD