GPU for AI Training: Pick the Right One Without Overspending

10 min read

Amnic

Amnic

FinOps for AI

Table of Contents

No headings found on page
  • A training GPU is decided by four levers: VRAM, memory bandwidth, low-precision throughput and interconnect speed.

  • For enterprise training, the working tiers are H100, H200 and B200. For LoRA and QLoRA fine-tuning, A100 80GB, L40S and RTX 5090 cover most workloads.

  • Pick by cost per finished run, not hourly rate. A100 often beats H100 on small workloads because H100 never saturates.

  • Cloud GPU wins for spiky and short workloads. Specialist clouds run 60 to 85 percent cheaper than hyperscalers on the same SKU.

  • Most AI teams waste 30 to 50% of GPU spend on idle capacity. GPU choice is half the battle. The other half is utilization.

Quick Decision Matrix

Workload

Recommended GPU

Why

Cost envelope

Pre-training 70B+ from scratch

B200 or H100 SXM multi-node

FP4/FP8 throughput, NVLink, max VRAM

$1M to $5M per run

Pre-training 7B from scratch

H100 80GB cluster

Saturates Transformer Engine

$50K to $500K

Full fine-tune 7B

A100 80GB or H100 80GB

Fits with headroom

$500 to $5K

QLoRA 70B

A100 80GB or H100 80GB

88 GB VRAM floor

$50 to $500

QLoRA 13B or 34B

L40S 48GB or RTX 5090

Cost-controlled

$20 to $200

QLoRA 7B

RTX 4070 12GB or RTX 4090 24GB

Fits on consumer

Under $20

Production inference

H100, L40S or L4

Bandwidth and cost per token

Workload dependent

The rest of this guide unpacks how to read that matrix: what each lever actually does, how to size VRAM to your model, which shortlist GPU fits which workload and how to turn an hourly rate into the cost of an actual training run. The same selection logic feeds directly into FinOps for AI workloads, because GPU spend is the single largest line item for most AI teams.

What Is a GPU for AI Training?

A GPU is a parallel processor that runs the matrix and tensor math behind model weights at scale, which is why every modern AI training pipeline depends on them. CPUs execute instructions sequentially across a handful of cores. A GPU runs thousands of cores in parallel and finishes the same matrix multiplication orders of magnitude faster, as spelled out in this parallel-compute primer.

For AI training specifically, three operations dominate every step:

  • Forward pass: Matrix multiplications between activations and weights, layer by layer.

  • Backward pass: Gradient calculation back through the network, roughly twice the compute of the forward pass.

  • Optimizer step: Weight updates using gradients and stored optimizer state.

All three are embarrassingly parallel inside each layer, which is exactly the workload GPUs were built for. A modern training GPU finishes a transformer training step in milliseconds where a CPU would take minutes.

The Four Things That Decide a Training GPU

Whether a GPU is the right pick comes down to four levers. Get them wrong and you either run out of memory mid-step, sit idle waiting on memory, or pay for capability your workload never touches.

VRAM

VRAM caps the model size, batch size and sequence length you can train without offloading to system memory. During training, four things sit in GPU memory at the same time:

  • Model weights (parameters at the chosen precision).

  • Gradients (one per parameter).

  • Optimizer state (Adam needs 8 bytes per parameter at FP32).

  • Activations (intermediate outputs, often larger than weights at long context).

A working rule is roughly 16 GB of VRAM per 1B parameters for full FP16 fine-tuning and about 0.5 GB per 1B for QLoRA, which loads the base model in 4-bit precision. The implication is simple. A 7B model needs around 80 GB to fine-tune fully, but only ~12 GB to fine-tune with QLoRA. That is the difference between needing a data-center GPU and fitting on a desktop card.

Memory bandwidth

Most training is memory-bound, not compute-bound. It does not matter how fast the cores are if they sit idle waiting for data. HBM3 and HBM3e GPUs move data 50 to 100 percent faster than GDDR cards, which shows up directly in tokens-per-second.

Working numbers:

  • HBM3e on H200: 141 GB at 4.8 TB/s.

  • HBM3 on H100: 80 GB at 3.35 TB/s.

  • HBM3 on MI300X: 192 GB at 5.3 TB/s.

  • GDDR6X on RTX 4090: 24 GB at 1.01 TB/s.

  • GDDR6 on L40S: 48 GB at 864 GB/s.

Long-context transformer training spends most of its time in the attention block, which is bandwidth-bound. That is why H200 and MI300X often beat H100 on a per-token basis despite similar compute throughput.

Low-precision throughput

Modern training runs in BF16, FP16, FP8 or FP4. Tensor Core throughput at low precision is what makes a GPU fast on transformers. Each generation has added a lower precision rung:

FP8 roughly doubles throughput vs FP16 with minimal loss in convergence on transformers. That is why H100 is roughly 3x more cost-efficient than A100 on large LLM training, despite the higher hourly rate. If your training code does not enable FP8, you are paying for a Hopper chip and using it like an Ampere chip.

Interconnect

Single-GPU training tops out fast. Past a single card, the speed at which gradients move between GPUs becomes the bottleneck. The progression across NVIDIA NVLink generations is steep:

  • NVLink 3.0 on A100 SXM: 600 GB/s.

  • NVLink 4.0 on H100 SXM: 900 GB/s.

  • NVLink 5.0 on B200: 1.8 TB/s.

PCIe variants of the same GPU are usually 2 to 3x slower on multi-GPU scaling because PCIe Gen5 tops out at 128 GB/s, half a generation behind NVLink. If your topology spans nodes, you also need a fast network gateway so cross-node traffic does not starve the GPUs. InfiniBand at 400 Gbps is the production standard for multi-node training.

How Much VRAM Do You Actually Need?

Three numbers drive the floor: model parameters, optimizer state and activation memory. Adam optimizer state alone needs 8 bytes per parameter at FP32. Activation memory often exceeds parameter memory at long context lengths.

Full fine-tuning floor (FP16)

Model size

Min VRAM

Practical GPU

7B

~80 GB

A100 80GB

13B

~160 GB

2x A100 80GB

30B

~320 GB

4x A100 80GB or 1x B200

70B

700+ GB

8x H100 SXM with NVLink

LoRA / QLoRA fine-tuning (4-bit base)

Model size

QLoRA VRAM

Practical GPU

7B

~12 GB

RTX 4070 12GB

13B

~20 GB

RTX 4090 24GB

30B

~44 GB

L40S 48GB

70B

~88 GB

A100 80GB or H100 80GB

The original QLoRA paper cuts the base model footprint by roughly 75 percent vs FP16 by loading weights in 4-bit precision and only training small low-rank adapter matrices. The practical takeaway is a clean tier ladder:

  • 7B: fits on a consumer GPU.

  • 13B to 34B: fits on a workstation or single data-center GPU.

  • 70B: needs one A100 80GB or H100 80GB.

  • 180B and up: needs multi-GPU with NVLink.

The Shortlist: Data-Center GPUs for Training

NVIDIA H100

The production workhorse. Most ML teams compare every other option to it.

  • VRAM: 80 GB HBM3.

  • Bandwidth: 3.35 TB/s on the SXM variant.

  • Precision: BF16, FP16, FP8 via Transformer Engine.

  • Interconnect: NVLink 4.0 at 900 GB/s.

  • Best for: production transformer training from 7B to 70B, FP8 pre-training, dense fine-tune workloads.

  • Watch out for: never saturates on small workloads, which is where A100 wins on cost per run.

NVIDIA H200

The H100 with more and faster memory.

  • VRAM: 141 GB HBM3e.

  • Bandwidth: 4.8 TB/s.

  • Best for: long-context training, large batches that would otherwise need sharding, memory-bound attention.

  • Why it matters: the extra VRAM often lets you skip a sharding step, which beats H100 on cost per finished run.

NVIDIA B200 and GB200 NVL72

The frontier choice for 100B+ pre-training.

  • VRAM: up to 192 GB HBM3e.

  • Bandwidth: 8 TB/s.

  • Precision: adds FP4, with the rack-scale GB200 NVL72 packing 72 Blackwell GPUs into a single system.

  • Interconnect: NVLink 5.0 at 1.8 TB/s.

  • Best for: pre-training of frontier models, hyperscaler-grade clusters, inference at scale where FP4 throughput matters.

NVIDIA A100

Older but still rational.

  • VRAM: 40 or 80 GB HBM2e.

  • Bandwidth: 1.94 TB/s.

  • Precision: BF16, FP16, TF32 (no FP8).

  • Best for: small to mid fine-tuning runs, QLoRA on 70B, workloads where H100 never saturates.

  • Surprise: on a nanoGPT-style training benchmark the A100 finished a run at 0.8 cents while the H100 finished at 1.6 cents, because the H100 idled.

AMD MI300X

The non-NVIDIA option that actually fits training.

  • VRAM and bandwidth: 192 GB HBM3 at 5.3 TB/s.

  • Stack: ROCm with PyTorch and JAX builds.

  • Best for: teams already running on ROCm and chasing maximum batch and context headroom.

  • Watch out for: smaller ecosystem, fewer pre-built containers, supply concentrated on a handful of clouds.

The Shortlist: Workstation and Consumer GPUs for Fine-Tuning

NVIDIA L40S

The mid-scale sweet spot.

  • VRAM: 48 GB GDDR6.

  • Best for: LoRA and QLoRA on 13B to 34B, multi-tenant fine-tune services, production inference.

  • Why it fits: workstation power envelope, data-center form factor, no consumer driver baggage.

NVIDIA RTX 6000 Ada and RTX A6000

Workstation cards with ECC memory.

  • VRAM: 48 GB GDDR6 with ECC.

  • Best for: under-desk experimentation, model debugging, fine-tune jobs when you cannot rent or wait on capacity.

  • Why it matters: ECC matters for week-long training runs where a single memory flip can kill convergence.

NVIDIA RTX 4090 and RTX 5090

The cost-controlled path for solo developers and small teams.

  • VRAM: 24 GB on the 4090, 32 GB GDDR7 on the 5090.

  • Best for: QLoRA on 7B to 34B, prototype work, low-volume inference.

  • Watch out for: no ECC memory, limited multi-GPU scaling, consumer drivers and power profiles. Ruled out of dense racks.

Training vs Fine-Tuning vs Inference

The four levers shift weight depending on the workload.

Training from scratch

  • Priority: VRAM, then interconnect, then FP8/FP4 throughput.

  • Recommended class: H100 SXM multi-node or B200.

  • Why: pre-training spends weeks at full saturation. Every percent of throughput compounds.

Fine-tuning

  • Priority: VRAM headroom, then memory bandwidth.

  • Recommended class: A100 80GB, H100 80GB or L40S, depending on model size.

  • Why: LoRA and QLoRA cut the memory floor by 4 to 16x. The bottleneck shifts to memory bandwidth, not compute.

Inference

  • Priority: memory bandwidth, latency and cost per token.

  • Recommended class: H100, L40S or L4.

  • Why: inference does not need backward pass or optimizer state, so raw VRAM matters less. Throughput per dollar wins.

Once a model is trained, distributing it to inference endpoints often runs over a content delivery network to keep first-byte latency low for global users.

What Does a Training Run Actually Cost?

Hourly rates lie. The number that matters is cost per finished run, which combines GPU rate, training time and idle waste.

Hourly GPU rates (cloud)

On-demand pricing varies 3 to 4x across providers for the same SKU:

  • H100 80GB: $2 to $7 per GPU per hour. Specialist clouds price H100 at $1.38 per hour, while hyperscalers list closer to $7.

  • A100 80GB: $1.30 to $3.90 per hour, with A100 at $1.49 per hour on smaller GPU clouds.

  • B200: rare on hyperscalers, $5 to $9 per hour on specialist clouds.

  • L40S: $0.80 to $1.50 per hour.

  • RTX 4090: $0.30 to $0.70 per hour on community clouds.

Cost per finished run

The hourly rate only tells you what the GPU bills. What you actually pay is rate multiplied by training time. The nanoGPT benchmark exposed the gap:

  • A100: finished the run at 0.8 cents.

  • H100 PCIe: finished at 2 cents.

  • H100 SXM: finished at 1.6 cents because faster training canceled out the higher hourly rate.

  • RTX 6000: looked cheap on hourly rate but came in at roughly 5x the total cost because the run stretched out.

Two lessons:

  • A newer GPU on a smaller workload is often more expensive per finished run, not less.

  • Within a generation, the higher interconnect variant usually wins on total cost.

Hidden costs

The hourly GPU bill is roughly 60 to 70 percent of the actual training cost. The rest sits in:

  • Checkpoint storage: $0.10 to $0.30 per GB per month, multiplied by every saved checkpoint.

  • Cross-region data transfer: $0.02 to $0.09 per GB egress.

  • Idle GPUs: AI teams routinely waste 30 to 50% of GPU budgets on idle capacity, including weekend Jupyter notebooks.

  • Failed runs: an OOM at hour 200 is a full restart.

To turn this into a real budget, you need GPU spend broken out by team and job, which is where showback and chargeback replace guesswork. Normalizing GPU cost data across AWS, GCP and Azure into a single shape is what the FOCUS open cost and usage specification standardizes. Once the picture is clear, the Amnic GPU cost optimization guide covers the playbook for cutting that spend.

Cloud vs On-Prem vs Hybrid

Cloud

Best for spiky training, experimentation and short runs.

On-prem

Pays back when GPU utilization stays high over multi-year horizons.

  • Capex: an 8x H100 server lands at $300K to $400K.

  • Lead time: weeks to months on Hopper, longer on Blackwell.

  • Depreciation: accelerates as each new generation lands.

  • Power: 700 W per H100 SXM means dense racks need liquid cooling.

Hybrid

Steady-state training on owned hardware, burst to cloud for peak demand.

  • A hybrid cloud setup works when cost model, networking and orchestration line up.

  • Managed training stacks shorten the path for teams that do not want to operate clusters, which fits the platform-as-a-service model.

  • The break-even point is usually 60 percent sustained utilization. Below that, cloud wins. Above it, on-prem and hybrid catch up.

Conclusion

The right GPU for AI training is the smallest tier where the model fits in VRAM with headroom, the Tensor Cores saturate at the precision you train in and the interconnect keeps gradients flowing. Everything past that is wasted dollars. Treat GPU choice as a cost decision, not a hardware decision and build it into the same operating model that owns FinOps and AI cost intelligence.

FAQs

Which GPU is best for AI training?

H100 SXM and H200 lead enterprise transformer training. B200 leads frontier pre-training. For LoRA and QLoRA fine-tuning, A100 80GB, L40S or RTX 5090 are the cost-controlled picks. The right answer depends on model size, training method and target cost per finished run.

How much VRAM do I need for AI training?

About 16 GB per 1B parameters for full FP16 fine-tuning, roughly 0.5 GB per 1B for QLoRA. A 7B QLoRA fits in 12 GB. A 70B QLoRA needs around 88 GB. Full fine-tuning a 70B model needs 700+ GB across multiple GPUs.

Is H100 always better than A100?

No. On small or unsaturated workloads A100 often wins on cost per finished run because H100 never fills its Tensor Cores. For 70B pre-training and FP8 transformer training, H100 is roughly 3x more cost-efficient than A100.

What does it cost to train a model on H100?

H100 80GB on-demand sits between $2 and $7 per GPU per hour. A 7B training run lands in the $50K to $500K range. A 70B run can hit $1M to $5M. Spot pricing cuts the GPU bill by 60 to 91 percent on AWS if restarts are tolerable.

Can I train AI models on a consumer GPU like the RTX 4090?

Yes for QLoRA fine-tuning up to 30B with quantization and for full fine-tuning of 7B with offloading. Pre-training from scratch needs data-center GPUs with NVLink. Consumer cards skip ECC memory and have limited multi-GPU scaling.

Is cloud GPU cheaper than buying one?

Cloud wins for spiky, experimental and short workloads. On-prem wins when GPU utilization stays high over multi-year horizons. Specialist GPU clouds beat hyperscalers by 60 to 85 percent on the same SKU.

What is the cheapest GPU that handles real AI training?

RTX 4090 24GB handles QLoRA on 7B and 13B. RTX 5090 32GB stretches to 34B QLoRA. L40S 48GB covers mid-scale fine-tuning. Below 24 GB VRAM you are limited to tiny models and prototypes.

Do I need NVIDIA, or does AMD MI300X work for training?

MI300X delivers 192 GB HBM3 and works for teams whose stack runs on ROCm. PyTorch and JAX both ship ROCm builds. CUDA still has the largest ecosystem, so NVIDIA remains the default for most teams.

FinOps OS powered by context-aware AI agents.

Start with a 30-day no-cost trial.

Read-only.

No credit card.

No commitment.

Want to assess how your FinOps journey can scale?

Benchmark maturity, close governance gaps, and drive ROI in under 20 minutes

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD

Can your engineering context keep up with the speed of AI?

Start with a 14-day Runtime Accountability Audit. Read-only access. No commitment.

No credit card · No migration · No agents

STAY AHEAD