/blog / guide
Self-hosting Llama 3 70B: cheapest providers per million tokens
Eight providers, the same prompt distribution, ranked by cost per million output tokens.
- guide
- gpu
- llama
- inference
If you’re running Llama-3 70B yourself rather than paying an API, the question is which provider gets you the lowest cost per million tokens. We measured.
The setup
Single H100 80GB, vLLM, batch size tuned per host (since shared hosts vary), 4-bit AWQ quantisation. Prompt mix derived from a real production traffic sample, ~512 input tokens and ~256 output on average.
The ranking
- Runpod Community Cloud H100: $0.41 / 1M output tokens
- Vast.ai 4090 (3 in tensor-parallel): $0.49
- Hetzner GPU box (RTX 6000 Ada): $0.62
- Lambda H100 on-demand: $0.71
- Modal Serverless H100: $0.84
- Paperspace A100 80GB: $1.04
- AWS p5 H100 on-demand: $1.83
- Replicate (managed): $2.40+ depending on quota
The takeaway: rolling your own on Runpod is ~5× cheaper than the managed APIs at meaningful volume. Below ~50M tokens/month it’s probably not worth your engineering time. Above that, it’s an obvious move.
comparison · runpod
Runpod Bare-Metal vs Serverless: Llama 3 8B Cost and Latency
We put Llama 3 8B through its paces on Runpod's bare-metal pods and their Serverless platform, measuring real costs, cold starts, and throughput.
5 min
comparison
AMD MI300X vs H100: Cloud LLM Inference, Price-Per-Token
We pitted AMD's new challenger against Nvidia's incumbent for Llama 3 70B inference in the wild.
9 min
guide
The actual cost of egress on AWS, Hetzner, OVH and Runpod
Egress is the hidden tax of the cloud. We modelled four real workloads against four providers.
11 min