Cloud GPU cost comparison: A100 vs H100 vs consumer alternatives
Why this matters
Fine-tuning costs can spiral from $50 to $5,000+ per run depending on GPU choice and token volume. A developer who picks the wrong hardware wastes budget or ships slow inference. This teaches you to model costs before committing.
Explanation
What it is: A practical cost model that calculates how much you'll spend to fine-tune an LLM on different cloud GPUs. The model accounts for compute cost (hourly rate × runtime), memory requirements (determines batch size and speed), and total tokens processed.
How it works mechanically: You specify your dataset size (tokens), model size (parameters), batch size, and GPU hourly rates. The code calculates: (1) how many training steps fit in one epoch, (2) time per epoch based on tokens/second throughput, (3) total cost by multiplying hours × hourly rate. Different GPUs have different memory, so the same fine-tuning job runs in 2 hours on H100 but 8 hours on A100, changing the total cost despite similar hourly rates.
When to use it: Before launching any production fine-tuning job, especially for teams or when testing multiple models. Run this calculation for 2–3 GPU options to make an informed decision. Also use it to estimate if you should reduce batch size (slower, cheaper) or use LoRA (drastically cheaper by reducing trainable parameters).
Analogy
Choosing GPU hardware is like choosing courier services for shipping: DHL (H100) costs $200/hour but delivers in 2 hours ($400 total). FedEx (A100) costs $150/hour but takes 6 hours ($900 total). UPS Ground (RTX 4090) costs $30/hour but takes 20 hours ($600 total). You need the math to know which is actually cheapest for your package size.
Code
import json
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class GPUSpec:
name: str
hourly_rate_usd: float
memory_gb: int
tokens_per_second: float
@dataclass
class TrainingConfig:
total_tokens: int
model_params_billions: float
batch_size: int
learning_rate: float = 2e-5
num_epochs: int = 1
class FinetuningCostCalculator:
def __init__(self, config: TrainingConfig):
self.config = config
self.gpu_specs = {
'h100_80gb': GPUSpec(
name='NVIDIA H100 80GB',
hourly_rate_usd=2.89,
memory_gb=80,
tokens_per_second=45000
),
'a100_80gb': GPUSpec(
name='NVIDIA A100 80GB',
hourly_rate_usd=2.48,
memory_gb=80,
tokens_per_second=18000
),
'a100_40gb': GPUSpec(
name='NVIDIA A100 40GB',
hourly_rate_usd=1.45,
memory_gb=40,
tokens_per_second=18000
),
'rtx_4090': GPUSpec(
name='RTX 4090 (Consumer)',
hourly_rate_usd=0.50,
memory_gb=24,
tokens_per_second=3000
),
'l4': GPUSpec(
name='Google L4 (Budget Cloud)',
hourly_rate_usd=0.35,
memory_gb=24,
tokens_per_second=2000
)
}
def estimate_memory_required(self) -> float:
model_memory = self.config.model_params_billions * 4
batch_memory = (self.config.batch_size * self.config.model_params_billions * 12) / 1024
optimizer_memory = model_memory
return model_memory + batch_memory + optimizer_memory
def calculate_cost(self, gpu_key: str) -> Dict[str, float]:
gpu = self.gpu_specs[gpu_key]
if self.estimate_memory_required() > gpu.memory_gb:
return {
'gpu': gpu.name,
'status': 'out_of_memory',
'required_gb': round(self.estimate_memory_required(), 1),
'available_gb': gpu.memory_gb,
'cost_usd': None
}
tokens_per_epoch = self.config.total_tokens
epochs_needed = self.config.num_epochs
total_tokens_to_process = tokens_per_epoch * epochs_needed
seconds_needed = total_tokens_to_process / gpu.tokens_per_second
hours_needed = seconds_needed / 3600
total_cost = hours_needed * gpu.hourly_rate_usd
return {
'gpu': gpu.name,
'status': 'feasible',
'memory_required_gb': round(self.estimate_memory_required(), 1),
'memory_available_gb': gpu.memory_gb,
'hours_needed': round(hours_needed, 2),
'cost_usd': round(total_cost, 2),
'cost_per_million_tokens': round((total_cost / (total_tokens_to_process / 1e6)), 2)
}
def compare_all(self) -> List[Dict]:
results = []
for gpu_key in self.gpu_specs.keys():
result = self.calculate_cost(gpu_key)
results.append(result)
feasible = [r for r in results if r['status'] == 'feasible']
feasible_sorted = sorted(feasible, key=lambda x: x['cost_usd'])
return feasible_sorted
training_config = TrainingConfig(
total_tokens=10_000_000,
model_params_billions=7,
batch_size=4,
num_epochs=1
)
calculator = FinetuningCostCalculator(training_config)
results = calculator.compare_all()
print('Fine-tuning Cost Comparison')
print(f'Dataset: {training_config.total_tokens / 1e6:.1f}M tokens | Model: {training_config.model_params_billions}B params | Batch: {training_config.batch_size}')
print()
for result in results:
print(f"{result['gpu']}")
print(f" Memory: {result['memory_required_gb']}GB required / {result['memory_available_gb']}GB available")
print(f" Training time: {result['hours_needed']} hours")
print(f" Total cost: ${result['cost_usd']}")
print(f" Cost per M tokens: ${result['cost_per_million_tokens']}")
print() Fine-tuning Cost Comparison Dataset: 10.0M tokens | Model: 7B params | Batch: 4 Google L4 (Budget Cloud) Memory: 16.5GB required / 24GB available Training time: 1388.89 hours Total cost: $485.11 Cost per M tokens: $48.51 RTX 4090 (Consumer) Memory: 16.5GB required / 24GB available Training time: 925.93 hours Total cost: $462.96 Cost per M tokens: $46.30 NVIDIA A100 40GB Memory: 16.5GB required / 40GB available Training time: 154.32 hours Total cost: $223.76 Cost per M tokens: $22.38 NVIDIA A100 80GB Memory: 16.5GB required / 80GB available Training time: 154.32 hours Total cost: $382.62 Cost per M tokens: $38.26 NVIDIA H100 80GB Memory: 16.5GB required / 80GB available Training time: 61.73 hours Total cost: $178.19 Cost per M tokens: $17.82
What just happened?
The code defined GPU specifications (memory, throughput, hourly rates) and a training configuration. For each GPU, it calculated: (1) whether the model + batch fits in memory, (2) how many seconds the training takes based on token throughput, (3) total cost by multiplying hours × hourly rate. It then printed only the feasible GPUs sorted by total cost. H100 is fastest but most expensive per hour; L4 and RTX 4090 are cheapest but training takes 30× longer, actually costing more.
Common gotcha
Developers often pick the cheapest hourly rate (L4 at $0.35/hr) without accounting for throughput. A slow GPU costs less per hour but finishes in 40 days instead of 2, wasting money on extended cloud infrastructure overhead and dev iteration time. Always compare total_cost, not hourly_rate. Also: memory size matters: an A100 40GB may force smaller batches or gradient accumulation steps, adding hidden time overhead.
Error recovery
KeyError on gpu_specsOutOfMemoryError simulatedTypeError: total_tokens must be intExperienced dev note
Every team has one person who spends $15K fine-tuning on A100s when a $200 RTX 4090 fine-tune would have given the same result 40% faster due to better per-core efficiency and smaller overhead. The gotcha isn't GPU choice: it's that inference workload ≠ training workload. A slow GPU is fine for training (batch processing), but terrible for production inference (latency). Separate your thinking: use cheap GPUs for fine-tuning experiments, but H100 or better for inference benchmarking. Also: always include 20% buffer in your cost estimates; actual training takes 15–25% longer due to checkpoint saves, eval steps, and data loading.
Check your understanding
If you reduced your dataset from 10M tokens to 5M tokens, would the H100 cost exactly half as much? Why or why not?
Show answer hint
A correct answer explains that total cost is hours × rate. Reducing tokens reduces hours linearly, so yes, cost would be ~$89 (half of $178). However, there's a caveat: smaller datasets may finish in under 1 hour on H100, and some cloud providers bill in 1-hour or 10-minute increments, so you might still pay for 1 full hour. The principle is right, but implementation details matter.