Cold start times: Runpod Serverless vs Modal vs Replicate

Cold start is the difference between “we can run inference serverlessly” and “we can’t.” We benchmarked the same container — quantised Llama-3 8B, 4.2GB image — across three serverless GPU platforms.

The numbers

Runpod Serverless: p50 2.4s, p99 2.8s
Modal: p50 3.1s, p99 4.6s
Replicate: p50 7.8s, p99 22.4s (yes, really)

Why the spread

Runpod warms the host’s image cache aggressively. Modal does too, but their scheduler is more willing to evict. Replicate optimises for cost over latency by default — they ship a tier that matches the others, just not their default.

If cold start is your headline metric, Runpod Serverless is the pick. If you need polyglot language SDKs and Python idioms baked in, Modal is worth the extra second.

Cold start times: Runpod Serverless vs Modal vs Replicate

The numbers

Why the spread

Runpod Bare-Metal vs Serverless: Llama 3 8B Cost and Latency

RTX 3090 Cloud Pricing: Runpod, Vast.ai, Vultr Compared

H200 Cloud Pricing: The Hunt for Nvidia's Newest GPU