What Is a Batch API? How Asynchronous Processing Cuts AI Spend in Half
7 min read
Engineering

Table of Contents
A batch API lets you group many requests into a single file and process them asynchronously in the background, instead of calling the model one request at a time. You submit the file, the provider queues the work, and you collect a finished output file later. The trade you accept is latency for price, because batch jobs return results within a guaranteed window rather than instantly.
That trade is the point. Most providers discount batch work by 50% against their synchronous rates, so any job that does not need a live answer becomes half as expensive. Before you size that saving, it helps to understand how tokenization works, since batch billing still meters input and output tokens the same way real-time calls do. The discount applies to the rate, not the token count.
This pattern fits offline work where no one is watching the screen. If a pipeline is waiting, batch is almost always the right call. If a person is waiting, it is the wrong one.
What Is a Batch API?
A batch API is an asynchronous endpoint that accepts a bundle of requests, processes them in the background, and returns all results together once the job finishes. You do not hold an open connection or poll for each reply. You upload, wait, and download.
The defining features are consistent across providers:
Cost savings: Bulk processing is heavily discounted, usually to 50% of standard synchronous rates.
Asynchronous execution: Jobs are queued and run in the background. You fetch a completed file later instead of waiting on a live response.
Separate rate limits: Batch work draws from its own quota, so it never throttles the real-time endpoints your product depends on.
That separate quota matters more than the headline discount for teams running live features. You can push a million-row classification job through batch without slowing the endpoint your app serves. The two workloads stop competing for the same capacity.
How a Batch API Works
The workflow is the same four steps almost everywhere:
Prepare the request file: Build a JSONL file where each line is one request with a custom ID you assign. The custom ID is how you map each response back to its input after the job returns.
Submit the job: Upload the file and start the batch. The provider validates it and queues the work.
Monitor status: Poll or use webhooks as the job moves through states like validating, in progress, finalizing, and completed or failed.
Retrieve results: Download the finished output file once the job ends and parse the responses by custom ID.
Because results arrive as a file and not a stream, your code reads the whole output once rather than handling tokens as they generate. A few requests inside a batch can fail or expire while the rest succeed, so you reconcile by custom ID rather than assuming the batch is all-or-nothing.
Batch API vs Synchronous API
A synchronous API answers immediately and bills at full rate. You open a request, the model responds in real time, and the user or service consumes the reply on the spot. This is what you want for chat, copilots, and anything interactive where a delay is visible.
A batch API answers later and bills at roughly half. Latency is measured in minutes to hours, and there is no live token stream. For a smart LLM gateway deployment, the practical pattern is routing interactive traffic to synchronous endpoints and shipping every deferrable job to batch automatically.
The decision rule is simple. If something blocks a human, stay synchronous. If something blocks only a downstream job or a scheduled run, move it to batch and take the discount.
Real-World Examples of Batch API Workloads
The easiest way to spot a batch candidate is to ask whether anyone is waiting for the answer. These are the jobs that almost always qualify:
Fintech transaction tagging: A payments team categorizes two million historical transactions overnight for a compliance report. Nobody reads the output until morning, so the whole run goes to batch at half price.
E-commerce catalog embeddings: A retailer regenerates product description embeddings for 500,000 SKUs after a catalog refresh. The vectors feed a nightly search reindex, not a live request.
Support ticket scoring: A SaaS team scores a backlog of last quarter's tickets for sentiment and churn risk. The analysis informs a planning review, so a few hours of delay costs nothing.
Model evaluation suites: An ML team runs 50,000 test prompts against a new model version to benchmark quality. Evals are offline by nature and a textbook batch job.
In each case, the work is high-volume, deferrable, and invisible to end users. That is the exact profile where the 50% discount lands cleanly.
Batch APIs Across the Major Providers
The three largest model providers all ship a batch endpoint with the same 50% discount and broadly similar mechanics. The differences are in turnaround, file limits, and how long results stay available. The table below uses figures from each provider's own documentation.
Provider | Discount | Turnaround | Batch size limit | Notes |
|---|---|---|---|---|
Amnic | n/a (visibility layer) | Tracks batch vs live spend | n/a | Allocates batch savings to teams and features |
OpenAI | 50% off synchronous | Within 24 hours | 50,000 requests or 200 MB | JSONL via Files API |
Gemini (Google) | 50% of the standard cost | Target 24 hours, often faster | 2 GB input file | Inline under 20 MB or file-based |
Anthropic (Claude) | 50% of standard prices | Most under 1 hour | 100,000 requests or 256 MB | Results retained for 29 days |
OpenAI takes a JSONL file through the Files API and returns a completed output file within 24 hours at half the standard model cost, per its batch guide. Status moves through validating, in progress, finalizing, and completed, and a single batch is capped at 50,000 requests or 200 MB.
Gemini processes large request volumes and embeddings asynchronously at 50% of standard cost, targeting 24 hours but often finishing sooner, according to Google's batch API docs. It accepts inline requests under 20 MB or input files up to 2 GB, and integrates with context caching at standard caching rates.
Anthropic's Message Batches API processes large volumes asynchronously, with most batches finishing in under an hour while charging 50% of standard prices, per its batch processing docs. A batch is limited to 100,000 requests or 256 MB, expires if it does not finish within 24 hours, and keeps results available for 29 days. If you are choosing between vendors here, the throughput and cost differences feed directly into a wider Gemini vs GPT evaluation.
Where a Batch API Cuts Cost, and Where Visibility Comes In
The savings are real but uneven. A batch discount only helps the portion of your workload that can tolerate delay, so the first job is to find that portion. Evaluations, nightly summarization, bulk tagging, and embedding refreshes are usually safe to defer, and they often sit unnoticed inside a larger bill.
This is the same off-peak logic that makes infrastructure cheaper. Running deferrable compute when capacity is loose is exactly how teams treat spot instances for cloud ROI, and batch APIs apply that thinking to model inference. You pay less by agreeing to wait.
The harder problem is attribution. Once batch and live calls hit the same provider account, a single invoice hides which teams, features, or jobs earned the discount and which never adopted it.
A dedicated AI token management layer breaks that spend down by team and feature, so a finance owner can see batch versus synchronous usage and confirm the savings actually landed. Pairing a batch rollout with Claude usage tracking closes the loop between a switch you flipped and a number on the bill.
When a Batch API Is the Wrong Choice
Batch is not free of trade-offs. The clearest one is latency, which rules it out for any interactive or user-facing path. A support copilot cannot tell a customer to come back in an hour, so that traffic stays synchronous regardless of price.
There are quieter limits too:
No streaming: Results return as one file, so you cannot show tokens as they generate.
Unsupported features: Some options, like cache pre-warming, are not allowed inside a batch.
Expiration risk: A batch that misses its 24-hour window returns partial results you have to reconcile.
Slower debugging: You lose the tight feedback loop of live calls, so a malformed prompt takes longer to catch.
The honest framing is that batch is a cost lever for a specific class of work, not a default. Apply it to deferrable, high-volume jobs, and you capture a clean 50%. Apply it to interactive traffic, and you break the product to save money you were never spending wastefully.
Conclusion
A batch API is the simplest large lever in AI cost management: take any workload that does not need a live answer, group it into a file, and pay half. OpenAI, Gemini, and Anthropic all offer the same 50% discount with separate rate limits, differing mainly in turnaround and size caps. The win is not flipping the switch; it is knowing which jobs qualify and then proving the savings showed up in the bill.
FAQ
What is a batch API?
A batch API is an asynchronous endpoint that accepts many requests in one file, processes them in the background, and returns all results together. It trades instant responses for a lower price, typically half the synchronous rate.
How much does a batch API save?
OpenAI, Gemini, and Anthropic each discount batch processing to 50% of their standard rates per their documentation. The saving applies only to work that can tolerate delay, not to interactive traffic.
How long does a batch job take?
Most providers guarantee completion within 24 hours and often finish sooner. Anthropic reports most batches finish in under an hour, while OpenAI and Gemini target the 24-hour window.
What file format does a batch API use?
OpenAI and most providers use JSONL, where each line is a single request with a custom ID you assign. You upload the file, start the job, then download a completed output file.
When should you not use a batch API?
Avoid batch for anything user-facing or interactive, since results are delayed and not streamed. Chat, copilots, and live features should stay on synchronous endpoints despite the higher rate.
Do batch requests use a separate rate limit?
Yes. Batch jobs draw from their own quota, so large offline workloads do not throttle the synchronous endpoints your production traffic depends on.
Better visibility and management into AI Tokens?
Start with a 30 day trial
Connect leading LLMs
24 hour time to value
Stay ahead of AI Spend

Make AI spend visible, controllable, and accountable.
Gain insights into your AI token costs at a team, customer, business unit and individual user level to measure and manage AI utilization.
Recommended Articles

How Does Tokenization Work? A Practical Guide for AI Teams
Read More

What Is an LLM Gateway? Routing, Cost Control, and Governance for Production AI
Read More

What Is GPU Utilization? How to Measure, Monitor, and Fix It
Read More

What Is a Token in AI? Definition, Counting & Cost
Read More

What is On-Demand Computing (ODC)?
Read More

30+ Best DevOps Tools for 2026 (by Category)
Read More






