/blog / review / runpod

Runpod Serverless deep-dive: cold starts, queueing, billing edges

We pushed Runpod Serverless to its limits over a 30-day production deployment. The good and the gotchas.

Tobias 11 min read 8.0
  • gpu
  • review
  • runpod
  • serverless

We covered Runpod broadly in our main review. This post is the deep-dive on Serverless specifically, after running it as the inference backend for a real product for 30 days.

Cold start, again

Already covered: 2.5s p50, 2.8s p99 in our test. This held up in production too. Worth re-stressing: this is the headline number, and it’s honest.

Queueing

When concurrent requests exceed warm workers, Runpod queues. The queue is per-endpoint and visible in the dashboard. Median queue depth across the test was 0.4 requests; max was 14 during a traffic spike.

Billing edges

Two gotchas worth knowing about:

  1. Idle workers cost money. If you set min_workers > 0, you pay for the seconds those workers are alive whether or not they’re serving traffic. Default to min_workers = 0 unless you have hard latency requirements.
  2. Cold starts bill from container start. Your “free” cold start period is bundled with your first request. This isn’t a scam — it’s how serverless GPU has to work — but the billing dashboard rounds aggressively in their favour.

Verdict

8.0. If we were starting an LLM-backed product today, this is where we’d put the inference.