/blog / review / runpod
Runpod Serverless deep-dive: cold starts, queueing, billing edges
We pushed Runpod Serverless to its limits over a 30-day production deployment. The good and the gotchas.
- gpu
- review
- runpod
- serverless
We covered Runpod broadly in our main review. This post is the deep-dive on Serverless specifically, after running it as the inference backend for a real product for 30 days.
Cold start, again
Already covered: 2.5s p50, 2.8s p99 in our test. This held up in production too. Worth re-stressing: this is the headline number, and it’s honest.
Queueing
When concurrent requests exceed warm workers, Runpod queues. The queue is per-endpoint and visible in the dashboard. Median queue depth across the test was 0.4 requests; max was 14 during a traffic spike.
Billing edges
Two gotchas worth knowing about:
- Idle workers cost money. If you set min_workers > 0, you pay for the seconds those workers are alive whether or not they’re serving traffic. Default to min_workers = 0 unless you have hard latency requirements.
- Cold starts bill from container start. Your “free” cold start period is bundled with your first request. This isn’t a scam — it’s how serverless GPU has to work — but the billing dashboard rounds aggressively in their favour.
Verdict
8.0. If we were starting an LLM-backed product today, this is where we’d put the inference.
review · runpod
Runpod review: bare-metal H100s without the enterprise tax
Six weeks on Runpod across Community Cloud, Secure Cloud, and Serverless. The benchmarks, the bills, and where it falls short of AWS and Lambda.
14 min
review
Runpod Serverless Cold Starts: A Thousand Invocations, Three Weeks Later
We measured cold start latency for a common PyTorch model across 1,000 invocations on Runpod Serverless.
10 min
comparison · runpod
Runpod Bare-Metal vs Serverless: Llama 3 8B Cost and Latency
We put Llama 3 8B through its paces on Runpod's bare-metal pods and their Serverless platform, measuring real costs, cold starts, and throughput.
5 min