Runpod Serverless deep-dive: cold starts, queueing, billing edges

We pushed Runpod Serverless to its limits over a 30-day production deployment. The good and the gotchas.

Tobias 14 December 2025 11 min read 8.0

gpu
review
runpod
serverless

We covered Runpod broadly in our main review. This post is the deep-dive on Serverless specifically, after running it as the inference backend for a real product for 30 days.

Cold start, again

Already covered: 2.5s p50, 2.8s p99 in our test. This held up in production too. Worth re-stressing: this is the headline number, and it’s honest.

Queueing

When concurrent requests exceed warm workers, Runpod queues. The queue is per-endpoint and visible in the dashboard. Median queue depth across the test was 0.4 requests; max was 14 during a traffic spike.

Billing edges

Two gotchas worth knowing about:

Idle workers cost money. If you set min_workers > 0, you pay for the seconds those workers are alive whether or not they’re serving traffic. Default to min_workers = 0 unless you have hard latency requirements.
Cold starts bill from container start. Your “free” cold start period is bundled with your first request. This isn’t a scam — it’s how serverless GPU has to work — but the billing dashboard rounds aggressively in their favour.

Verdict

8.0. If we were starting an LLM-backed product today, this is where we’d put the inference.

Runpod Serverless deep-dive: cold starts, queueing, billing edges

Cold start, again

Queueing

Billing edges

Verdict

Runpod review: bare-metal H100s without the enterprise tax

Runpod Serverless Cold Starts: A Thousand Invocations, Three Weeks Later

Runpod Bare-Metal vs Serverless: Llama 3 8B Cost and Latency