Spot Instances: How to Maximize Cloud ROI Without the Risk
7 min read
AWS

Table of Contents
Spot instances let you run workloads on a cloud provider's spare compute at a discount of up to 90% off on-demand pricing, in exchange for the provider reclaiming that capacity with a short interruption notice. For teams under pressure to lower their cloud bill, that trade is one of the largest single levers on compute spend.
The catch is real but manageable. Spot capacity can be taken back, so the work you place on it has to tolerate being interrupted and resumed. This guide covers what spot instances are, which workloads fit, how the providers differ and how to fold spot into a measured cloud cost optimization program.
What Are Spot Instances?
A spot instance is unused cloud compute capacity sold at a steep discount. The provider runs more physical servers than customers reserve at any moment and spot pricing puts that idle capacity to work. You get the same hardware and performance as an on-demand virtual machine, only cheaper and revocable.
The discount is the headline. AWS quotes savings of up to 90% versus on-demand for spot capacity and the figure holds across instance families rather than a lucky few. On a large footprint, it shifts your AWS EC2 pricing math more than most rightsizing wins ever will.
What you give up is guaranteed availability. The provider can reclaim a spot instance when it needs the capacity back for on-demand demand. That is the whole bargain and it is why spot sits next to commitment discounts like Savings Plans vs Reserved Instances rather than replacing them.
How Do Spot Instances Work?
There is no auction and no bidding. AWS retired the old bid-based model, so you request spot capacity and pay the current spot price for the instance type you choose. Capacity is organized into pools defined by instance type, operating system and availability zone.
When a pool runs short, the provider reclaims instances and sends an interruption notice first. On AWS the warning is two minutes, which is enough time for your automation to drain work and save progress before the instance goes away. Prices move gradually with supply and demand instead of spiking.
Interruptions are less frequent than most people expect. AWS reports a historical interruption rate under 5% across all regions and instance types, though your real rate depends on the pools you draw from. Tracking that rate belongs in your cloud utilization metrics so you can see which pools stay stable.
Spot Instances Use Cases
The rule for fit is simple. If a workload can stop, requeue and resume without corrupting data or breaking a customer promise, it is a candidate for spot. That covers a wide range of non-critical workloads where flexible timing matters more than guaranteed uptime, the kind of cloud resource utilization decision FinOps teams make daily.
The four workload patterns below cover most of where spot earns its keep.
Batch and data processing: Report generation, video encoding, ETL jobs and Spark or Hadoop clusters scale across many nodes and tolerate losing a few. Distributed jobs reschedule lost tasks automatically, so spot cuts the cost of large processing runs without changing the result.
CI/CD pipelines: Build and test runners are short-lived and stateless. Jenkins, GitLab and Buildkite agents run cleanly on spot so a wave of pull requests does not push your build farm onto full-price compute and a failed runner simply restarts on fresh capacity.
Machine learning training: GPU training is expensive and long-running, but checkpointing to storage lets a job resume from its last saved state after an interruption. That makes spot one of the better ways for early teams to control AI compute cost and it is a common pattern in AI cost optimization tools for startups where the discount matters most.
Containers and Kubernetes: Stateless pods behind a load balancer reschedule onto new nodes when one disappears. Running them on spot is a core EKS cost optimization move, since the orchestrator already treats node loss as a recoverable event.
Workloads to Keep Off Spot
Spot is not a universal default. Forcing it onto the wrong workload creates incidents instead of savings. Some systems need guaranteed capacity and the honest answer is to leave them on on-demand or committed pricing rather than chase a discount that will bite you later.
Skip spot for:
Primary production databases: A lost write window or a half-failed-over replica costs more than the discount ever saves, so keep the primary on on-demand or reserved capacity.
Strict-uptime customer-facing services: Anything with a tight SLA that cannot absorb a 30-second to two-minute disruption belongs on guaranteed capacity, not a fleet that can vanish at the worst moment.
Stateful apps without recovery logic: If an interruption corrupts in-memory state and the app has no checkpoint or replication, spot turns small reclaims into multi-hour cleanup work.
Regulated workloads: HIPAA and PCI-DSS environments usually belong on stable capacity with audit trails intact, with disciplined cloud cost management handling the cost side instead.
Spot Instances Across Cloud Providers
Every major provider sells spare capacity, but the terms differ enough to change your architecture. Matching a workload to the right behavior is easier when you track live rates and AWS Azure and GCP discounts side by side instead of guessing from memory.
Provider | Discount vs on-demand | Interruption notice | Pricing behavior | Runtime limit |
|---|---|---|---|---|
AWS EC2 Spot | Up to 90% | 2 minutes | Gradual, supply and demand | None |
Azure Spot VMs | Up to 90% | 30 seconds | Variable, optional max price | None |
Google Cloud Spot VMs | Up to 91% | Short shutdown signal | Predictable, changes infrequently | None |
AWS gives the longest warning at two minutes, which makes graceful shutdown the easiest of the three. Azure evicts with 30 seconds notice and lets you choose a deallocate or delete policy, so you control whether the disk survives an eviction. The shorter window means your shutdown logic has to be fast and tested.
One correction worth flagging on Google. Spot VMs there reach up to 91% off and no longer carry the 24-hour runtime cap that the older preemptible VMs had. If you compared providers a while back and ruled out Google on that limit, the constraint is gone and the option deserves a fresh look.
How to Handle Interruptions
Treat interruptions as expected events, not failures. The teams that win on spot design for reclaim from the start, so a lost instance is a routine reschedule rather than an outage. Four practices do most of the work.
Diversify your pools: Let your fleet draw from many instance types, families and availability zones. The price-capacity-optimized allocation strategy picks pools that are both cheap and deeply available, which lowers interruption rates compared with chasing the lowest price alone.
Checkpoint to external storage: Save progress to S3 or an equivalent so an interrupted job resumes from its last checkpoint instead of starting over. On GPU jobs, pair checkpointing with GPU usage monitoring so an idle node never bills as a busy one.
Enable Capacity Rebalancing: It launches a replacement instance as soon as a rebalance recommendation arrives, often before the two-minute notice fires, which gives in-flight work a chance to drain gracefully instead of dying at the deadline.
Automate node replacement: On Kubernetes, Karpenter watches an interruption queue, drains the doomed node and provisions a fresh one in parallel so pods reschedule without downtime. Blend a small on-demand baseline under the spot fleet for guaranteed floor capacity.
Common Misconceptions About Spot Instances
Three myths keep teams from savings they could already have. Clearing them up is usually the fastest path to adoption, because the blockers are beliefs rather than technology.
"You have to bid for spot". Spot stopped being an auction when AWS moved to fixed spot pricing, so there is no market to play and no bid to lose. You request capacity and pay the current spot price for the hour.
"Spot is inferior hardware". Spot runs on the same physical hardware as on-demand, so an instance is never slower for being cheaper. The only difference is the provider's right to reclaim the capacity.
"Spot is only for dev environments". Plenty of production traffic already runs behind load balancers and container orchestration built to survive node loss, which is exactly what spot needs. The barrier is architectural readiness, not the environment label and readiness shows up clearly in your cloud spending visibility data.
Where Spot Fits in a FinOps Strategy
Spot is one tactic inside a larger plan, not a strategy on its own. The real return comes from layering it correctly: commitments for steady baseline load, spot for interruptible bursts and on-demand only where you truly need guaranteed capacity. A mature FinOps practice decides that mix deliberately rather than by habit.
To run that mix, you need attribution. Knowing the effective savings rate spot delivers per team or service tells you where the discount is real and where interruptions are quietly eroding it. The same allocation discipline applies to AI spend, which is why teams adopting Anthropic cost allocation tools increasingly pair token attribution with compute attribution.
Forecasting is the other half. Feeding spot adoption into cloud cost forecasting keeps projections honest when a chunk of compute is variable by design. For mixed workloads spanning text, vision and audio inference on the same fleet, the same approach extends to multimodal cost optimization tools on top of the spot layer.
The Bottom Line
Spot instances turn a cloud provider's spare capacity into one of the cheapest ways to run interruptible work. The savings are real, the interruption risk is manageable and the workloads that fit are broader than most teams assume.
Start with a non-production pilot, instrument the interruption rate and utilization, then expand into production as confidence grows. Done with attribution and the right automation, spot stops being a gamble and becomes a standing part of how you control cloud cost.
FAQs
What makes spot instances different from on-demand or reserved instances?
Spot uses spare capacity at up to 90% off but can be reclaimed with a short notice. On-demand gives guaranteed availability at full price. Reserved instances and Savings Plans trade a one to three year commitment for a lower steady rate on baseline workloads.
Can I run production workloads on spot instances?
Yes. Stateless services behind load balancers and container platforms built to survive node loss run on spot in production every day. The requirement is fault tolerance and automated recovery, not the environment. Blend a small on-demand baseline for guaranteed capacity.
How much notice do I get before a spot instance is interrupted?
AWS sends a two-minute interruption notice. Azure gives roughly 30 seconds before eviction. Google Spot VMs send a short shutdown signal. Use that window with checkpointing and capacity rebalancing so work drains and resumes cleanly instead of being lost.
How do I reduce the impact of spot interruptions?
Diversify across instance types, families and availability zones, use a capacity-optimized allocation strategy, checkpoint progress to external storage and enable Capacity Rebalancing. On Kubernetes, an autoscaler that watches the interruption queue drains and replaces nodes automatically.
Do I still need to bid for spot instances?
No. AWS removed the bidding model, so you request spot capacity and pay the current spot price with no auction to manage. Prices move gradually with supply and demand rather than spiking, which makes spot spend easier to predict.
How do spot instances fit into a FinOps strategy?
Spot covers interruptible bursts while commitments handle steady baseline load and on-demand covers must-never-fail workloads. Tracking the effective savings rate per team and feeding spot into forecasting keeps the discount honest and shows where interruptions erode it.
Better visibility and management into AI Tokens?
Start with a 30 day trial
Connect leading LLMs
24 hour time to value
Stay ahead of AI Spend

Make AI spend visible, controllable, and accountable.
Gain insights into your AI token costs at a team, customer, business unit and individual user level to measure and manage AI utilization.
Recommended Articles

AWS RDS Pricing: A Complete Cost Breakdown (With a Worked Example and Every Region)
Read More

DynamoDB Pricing: What You Actually Pay For and How to Control It
Read More

AWS Lambda Pricing: How It Works and What You Pay
Read More

EKS Cost Optimization: Best Practices to Cut Your AWS Kubernetes Bill
Read More

AWS Fargate vs EC2: Cost, Control & When to Use Each
Read More

AWS Data Transfer Costs: Pricing, Hidden Charges and How to Reduce Them
Read More






