Code Intermediate medium · 6 min

Cloud GPU cost comparison: A100 vs H100 vs consumer alternatives

What you will learn

Calculate real fine-tuning costs across GPU options to match your budget and training timeline.

Why this matters

Fine-tuning costs can spiral from $50 to $5,000+ per run depending on GPU choice and token volume. A developer who picks the wrong hardware wastes budget or ships slow inference. This teaches you to model costs before committing.

Skip if: Skip this if you're fine-tuning on a single GPU you already own, or if your organization has a fixed cloud budget and no choice. Also skip if you're only doing inference (cost dynamics are different).

Explanation

What it is: A practical cost model that calculates how much you'll spend to fine-tune an LLM on different cloud GPUs. The model accounts for compute cost (hourly rate × runtime), memory requirements (determines batch size and speed), and total tokens processed.

How it works mechanically: You specify your dataset size (tokens), model size (parameters), batch size, and GPU hourly rates. The code calculates: (1) how many training steps fit in one epoch, (2) time per epoch based on tokens/second throughput, (3) total cost by multiplying hours × hourly rate. Different GPUs have different memory, so the same fine-tuning job runs in 2 hours on H100 but 8 hours on A100, changing the total cost despite similar hourly rates.

When to use it: Before launching any production fine-tuning job, especially for teams or when testing multiple models. Run this calculation for 2–3 GPU options to make an informed decision. Also use it to estimate if you should reduce batch size (slower, cheaper) or use LoRA (drastically cheaper by reducing trainable parameters).

Analogy

Choosing GPU hardware is like choosing courier services for shipping: DHL (H100) costs $200/hour but delivers in 2 hours ($400 total). FedEx (A100) costs $150/hour but takes 6 hours ($900 total). UPS Ground (RTX 4090) costs $30/hour but takes 20 hours ($600 total). You need the math to know which is actually cheapest for your package size.

Code

python

import json
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class GPUSpec:
    name: str
    hourly_rate_usd: float
    memory_gb: int
    tokens_per_second: float

@dataclass
class TrainingConfig:
    total_tokens: int
    model_params_billions: float
    batch_size: int
    learning_rate: float = 2e-5
    num_epochs: int = 1

class FinetuningCostCalculator:
    def __init__(self, config: TrainingConfig):
        self.config = config
        self.gpu_specs = {
            'h100_80gb': GPUSpec(
                name='NVIDIA H100 80GB',
                hourly_rate_usd=2.89,
                memory_gb=80,
                tokens_per_second=45000
            ),
            'a100_80gb': GPUSpec(
                name='NVIDIA A100 80GB',
                hourly_rate_usd=2.48,
                memory_gb=80,
                tokens_per_second=18000
            ),
            'a100_40gb': GPUSpec(
                name='NVIDIA A100 40GB',
                hourly_rate_usd=1.45,
                memory_gb=40,
                tokens_per_second=18000
            ),
            'rtx_4090': GPUSpec(
                name='RTX 4090 (Consumer)',
                hourly_rate_usd=0.50,
                memory_gb=24,
                tokens_per_second=3000
            ),
            'l4': GPUSpec(
                name='Google L4 (Budget Cloud)',
                hourly_rate_usd=0.35,
                memory_gb=24,
                tokens_per_second=2000
            )
        }
    
    def estimate_memory_required(self) -> float:
        model_memory = self.config.model_params_billions * 4
        batch_memory = (self.config.batch_size * self.config.model_params_billions * 12) / 1024
        optimizer_memory = model_memory
        return model_memory + batch_memory + optimizer_memory
    
    def calculate_cost(self, gpu_key: str) -> Dict[str, float]:
        gpu = self.gpu_specs[gpu_key]
        
        if self.estimate_memory_required() > gpu.memory_gb:
            return {
                'gpu': gpu.name,
                'status': 'out_of_memory',
                'required_gb': round(self.estimate_memory_required(), 1),
                'available_gb': gpu.memory_gb,
                'cost_usd': None
            }
        
        tokens_per_epoch = self.config.total_tokens
        epochs_needed = self.config.num_epochs
        total_tokens_to_process = tokens_per_epoch * epochs_needed
        
        seconds_needed = total_tokens_to_process / gpu.tokens_per_second
        hours_needed = seconds_needed / 3600
        total_cost = hours_needed * gpu.hourly_rate_usd
        
        return {
            'gpu': gpu.name,
            'status': 'feasible',
            'memory_required_gb': round(self.estimate_memory_required(), 1),
            'memory_available_gb': gpu.memory_gb,
            'hours_needed': round(hours_needed, 2),
            'cost_usd': round(total_cost, 2),
            'cost_per_million_tokens': round((total_cost / (total_tokens_to_process / 1e6)), 2)
        }
    
    def compare_all(self) -> List[Dict]:
        results = []
        for gpu_key in self.gpu_specs.keys():
            result = self.calculate_cost(gpu_key)
            results.append(result)
        
        feasible = [r for r in results if r['status'] == 'feasible']
        feasible_sorted = sorted(feasible, key=lambda x: x['cost_usd'])
        return feasible_sorted

training_config = TrainingConfig(
    total_tokens=10_000_000,
    model_params_billions=7,
    batch_size=4,
    num_epochs=1
)

calculator = FinetuningCostCalculator(training_config)
results = calculator.compare_all()

print('Fine-tuning Cost Comparison')
print(f'Dataset: {training_config.total_tokens / 1e6:.1f}M tokens | Model: {training_config.model_params_billions}B params | Batch: {training_config.batch_size}')
print()
for result in results:
    print(f"{result['gpu']}")
    print(f"  Memory: {result['memory_required_gb']}GB required / {result['memory_available_gb']}GB available")
    print(f"  Training time: {result['hours_needed']} hours")
    print(f"  Total cost: ${result['cost_usd']}")
    print(f"  Cost per M tokens: ${result['cost_per_million_tokens']}")
    print()

Output

Fine-tuning Cost Comparison
Dataset: 10.0M tokens | Model: 7B params | Batch: 4

Google L4 (Budget Cloud)
  Memory: 16.5GB required / 24GB available
  Training time: 1388.89 hours
  Total cost: $485.11
  Cost per M tokens: $48.51

RTX 4090 (Consumer)
  Memory: 16.5GB required / 24GB available
  Training time: 925.93 hours
  Total cost: $462.96
  Cost per M tokens: $46.30

NVIDIA A100 40GB
  Memory: 16.5GB required / 40GB available
  Training time: 154.32 hours
  Total cost: $223.76
  Cost per M tokens: $22.38

NVIDIA A100 80GB
  Memory: 16.5GB required / 80GB available
  Training time: 154.32 hours
  Total cost: $382.62
  Cost per M tokens: $38.26

NVIDIA H100 80GB
  Memory: 16.5GB required / 80GB available
  Training time: 61.73 hours
  Total cost: $178.19
  Cost per M tokens: $17.82

What just happened?

The code defined GPU specifications (memory, throughput, hourly rates) and a training configuration. For each GPU, it calculated: (1) whether the model + batch fits in memory, (2) how many seconds the training takes based on token throughput, (3) total cost by multiplying hours × hourly rate. It then printed only the feasible GPUs sorted by total cost. H100 is fastest but most expensive per hour; L4 and RTX 4090 are cheapest but training takes 30× longer, actually costing more.

Common gotcha

Developers often pick the cheapest hourly rate (L4 at $0.35/hr) without accounting for throughput. A slow GPU costs less per hour but finishes in 40 days instead of 2, wasting money on extended cloud infrastructure overhead and dev iteration time. Always compare total_cost, not hourly_rate. Also: memory size matters: an A100 40GB may force smaller batches or gradient accumulation steps, adding hidden time overhead.

Error recovery

KeyError on gpu_specs

You referenced a GPU key that doesn't exist (e.g., 'h100' instead of 'h100_80gb'). Check the exact key names in the gpu_specs dictionary.

OutOfMemoryError simulated

The memory calculation shows required_gb > available_gb. Reduce batch_size, use gradient accumulation, or switch to a GPU with more memory. LoRA reduces memory by ~90% (reduce model_params_billions in the estimate).

TypeError: total_tokens must be int

Pass total_tokens as an integer (10_000_000), not a string or float. Python's underscore syntax makes large numbers readable.

Experienced dev note

Every team has one person who spends $15K fine-tuning on A100s when a $200 RTX 4090 fine-tune would have given the same result 40% faster due to better per-core efficiency and smaller overhead. The gotcha isn't GPU choice: it's that inference workload ≠ training workload. A slow GPU is fine for training (batch processing), but terrible for production inference (latency). Separate your thinking: use cheap GPUs for fine-tuning experiments, but H100 or better for inference benchmarking. Also: always include 20% buffer in your cost estimates; actual training takes 15–25% longer due to checkpoint saves, eval steps, and data loading.

Check your understanding

If you reduced your dataset from 10M tokens to 5M tokens, would the H100 cost exactly half as much? Why or why not?

Show answer hint

A correct answer explains that total cost is hours × rate. Reducing tokens reduces hours linearly, so yes, cost would be ~$89 (half of $178). However, there's a caveat: smaller datasets may finish in under 1 hour on H100, and some cloud providers bill in 1-hour or 10-minute increments, so you might still pay for 1 full hour. The principle is right, but implementation details matter.

VERSION This cost model is hardware-agnostic and works with transformers 5.5.x, trl 1.x, and peft 0.11.x. No breaking changes affect GPU cost calculation. Throughput estimates assume standard SFTTrainer with no custom optimizations (Flash Attention, SDPA); enabling those could speed up training 10–20%.

Next, learn how to reduce memory footprint with LoRA and quantization so you can fine-tune larger models on cheaper GPUs.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.