Deployment & Infrastructure

Cold Start (Serverless AI)

Eliminating the Latency Tax of On-Demand AI Infrastructure

In a Nutshell

A cold start occurs when a serverless AI inference request arrives at an instance with no model loaded in memory, triggering a multi-step initialization sequence — container startup, model weight download, GPU memory loading — before the first token can be generated. For the enterprise, cold starts are a critical SLO risk: a model that normally responds in 200ms can take 30 seconds on a cold start, breaking user experience contracts and making serverless economics viable only with deliberate mitigation.

The Concept, Explained

Cold starts in serverless AI follow a predictable sequence: the platform receives a request with no warm worker available, provisions a new container (2–10 seconds), downloads model weights from object storage to local disk (10–120 seconds depending on model size), loads weights from disk into GPU VRAM (5–30 seconds), and finally processes the request. For small models (7B parameters, ~14GB), this sequence takes 15–30 seconds. For large models (70B+), it can exceed two minutes — a commercial non-starter for interactive applications.

Four mitigation strategies have emerged as production-ready. **Minimum instance counts** maintain a warm floor of pre-loaded replicas, eliminating cold starts for predictable baseline traffic at the cost of idle GPU spend. **Predictive pre-warming** uses historical traffic patterns and calendar events to pre-scale infrastructure before demand arrives. **Model weight caching** co-locates weights on high-speed local NVMe storage attached to the inference instance, reducing load time from network download to disk-to-VRAM transfer. **Lazy loading and quantization** reduce the volume of data that must be loaded — a 4-bit quantized model loads in a fraction of the time of its FP16 counterpart.

The enterprise trade-off is between cost and availability. True scale-to-zero (zero idle spend) always carries cold start risk. The optimal architecture for most enterprises is a hybrid: a minimum warm instance pool sized to handle baseline traffic, with serverless burst capacity that accepts cold start latency for infrequent overflow requests. Monitoring cold start frequency and duration in production — and exposing it as a first-class SLO metric — is the prerequisite for making this trade-off deliberately.

The Toolchain in Focus

Type	Tools
Serverless Inference Platforms	Modal Replicate Baseten AWS Lambda (with GPU)
Model Optimization (Reduce Load Time)	GGUF / llama.cpp vLLM ONNX Runtime
Observability	Datadog Prometheus / Grafana

Enterprise Considerations

SLO Definition: Cold start latency should be a named SLO dimension, not an afterthought. Define p99 latency targets that account for cold start probability, and surface cold-start-vs-warm latency as separate metrics in your observability dashboard. Alerting only on aggregate p99 can mask a cold start problem that affects a predictable subset of users.

Model Size & Architecture Trade-offs: Every byte of model weight contributes to cold start time. This creates a genuine business case for model distillation and quantization beyond pure inference speed: a 4-bit quantized 7B model may perform near-equivalently to its FP16 counterpart for your use case while loading four times faster. Benchmark load time as a first-class metric alongside accuracy during model selection.

Multi-Region Strategy: Cold starts are more likely in secondary or disaster-recovery regions where traffic is low and instances rarely stay warm. Design your multi-region architecture with explicit warm-instance policies per region, accepting higher baseline spend in exchange for consistent cross-region availability — especially for customer-facing applications where a cold start in a failover scenario can compound an already degraded user experience.

Related Tools

Modal

Serverless GPU platform with model weight caching and sub-second warm-start architecture for production inference.

View on Xither

vLLM

High-throughput inference engine with PagedAttention that accelerates both model initialization and token generation speed.

View on Xither

Baseten

Inference platform with configurable minimum replica counts and cold start monitoring to manage serverless SLOs.

View on Xither

Replicate

Cloud inference platform offering warm deployment options and detailed cold start latency visibility per model.

View on Xither

Cold StartServerless AIInference LatencyGPU ServerlessModel LoadingSLODeployment