Serverless Endpoints
Deploy and manage serverless GPU endpoints that scale to zero
Serverless Endpoints
Deploy ML applications as serverless GPU endpoints. Endpoints scale to zero when idle and wake on request — you only pay for compute time.
GPU CLI wraps RunPod Serverless with a simple CLI and config-driven workflow. RunPod handles scaling, provisioning, and cold start optimization. You focus on your application.
Prerequisites
- GPU CLI installed and authenticated
- RunPod API key configured (
gpu auth login)
Quick Start
1. Create Configuration
In your project directory, create or update gpu.jsonc with a serverless block:
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "comfyui",
"gpu_type": "NVIDIA GeForce RTX 4090",
"scaling": {
"min_workers": 0,
"max_workers": 3,
"idle_timeout": 5
}
}
}2. Deploy
gpu serverless deployGPU CLI will:
- Resolve the template (official RunPod worker image)
- Create or reuse a network volume for model storage
- Create the serverless endpoint with your scaling config
- Return the endpoint URL
3. Call Your Endpoint
Once deployed, send requests to your endpoint using the RunPod API:
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": {"prompt": "a cat in space"}}'Templates
Serverless endpoints use official RunPod worker images. Specify a template in your config or via the --template flag.
| Template | Worker Image | Use Case |
|---|---|---|
auto | Auto-detected from project | Default — inspects project files |
comfyui | RunPod ComfyUI Worker | Image generation workflows |
vllm | RunPod vLLM Worker | LLM inference (OpenAI-compatible API) |
whisper | RunPod Whisper Worker | Audio transcription |
custom-image | Your Docker image | Custom serverless workers |
Configuration
The serverless block in gpu.jsonc controls deployment. Fields are split into portable settings (work across providers) and RunPod-specific settings.
See the full Configuration Reference for all options.
Minimal Configuration
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "vllm",
"gpu_type": "NVIDIA A100 80GB PCIe"
}
}Full Configuration
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "comfyui",
"gpu_type": "NVIDIA GeForce RTX 4090",
"gpu_types": ["NVIDIA L4", "NVIDIA RTX A4000"],
"scaling": {
"min_workers": 0,
"max_workers": 5,
"idle_timeout": 10
},
"volume": {
"name": "my-project-vol",
"size_gb": 200,
"mount_path": "/runpod-volume"
},
"prewarm": {
"enabled": true,
"mode": "cpu"
},
"runpod": {
"flashboot": true,
"scaler_type": "queue_delay",
"scaler_value": 4,
"execution_timeout_ms": 600000,
"container_disk_gb": 50,
"data_center_ids": ["US-TX-3", "CA-MTL-1"],
"env": {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}
}GPU Fallback
Specify multiple GPU types in priority order. The first available GPU is used:
{
"serverless": {
"template": "vllm",
"gpu_type": "NVIDIA A100 80GB PCIe",
"gpu_types": ["NVIDIA L4", "NVIDIA GeForce RTX 4090"]
}
}The gpu_type field is the primary choice. The gpu_types array provides fallbacks if the primary GPU is unavailable.
Managing Endpoints
List Endpoints
# List all endpoints for current project
gpu serverless list
# List all endpoints across all projects
gpu serverless list --all
# JSON output for scripting
gpu serverless list --jsonCheck Status
# Status by endpoint name or ID
gpu serverless status my-endpoint
# JSON output
gpu serverless status my-endpoint --jsonStatus shows worker count, scaling config, queue depth, and endpoint URL.
Pre-Warm Endpoints
Reduce cold-start latency by pre-warming your endpoint. This downloads models to the network volume before the first request.
# CPU warming (recommended) — cheap, caches models to volume
gpu serverless warm my-endpoint --cpu
# GPU warming — tests full inference pipeline
gpu serverless warm my-endpoint
# Custom timeout
gpu serverless warm my-endpoint --timeout 900CPU mode ($0.06/hr) uses a cheap CPU pod to download models. GPU mode ($0.40/hr) spins up a real GPU worker. CPU mode is recommended for most use cases.
Delete Endpoints
# Interactive deletion with TUI confirmation
gpu serverless delete my-endpoint
# Force delete (skip confirmation)
gpu serverless delete my-endpoint --force
# Interactive selection from all endpoints
gpu serverless deleteNetwork volumes are preserved when deleting endpoints.
Delete Templates
Clean up user-owned serverless templates:
# Interactive selection
gpu serverless template delete
# Delete specific template
gpu serverless template delete tmpl_123
# Force delete
gpu serverless template delete tmpl_123 --forceView Logs
gpu serverless logs my-endpointNote: Serverless logs are currently available via the RunPod dashboard. Full CLI log streaming is planned for a future release.
Calling Your Endpoint
vLLM (OpenAI-Compatible)
vLLM endpoints expose an OpenAI-compatible API. Use the standard OpenAI SDK:
TypeScript / JavaScript:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.RUNPOD_API_KEY,
baseURL: `https://api.runpod.ai/v2/${process.env.ENDPOINT_ID}/openai/v1`,
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-8B-Instruct",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}Python:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["RUNPOD_API_KEY"],
base_url=f"https://api.runpod.ai/v2/{os.environ['ENDPOINT_ID']}/openai/v1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")ComfyUI
ComfyUI endpoints accept workflow JSON via the RunPod API:
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"workflow": { ... }
}
}'TypeScript with RunPod SDK:
import Runpod from "runpod-sdk";
Runpod.apiKey = process.env.RUNPOD_API_KEY;
const endpoint = Runpod.endpoint(process.env.ENDPOINT_ID);
const result = await endpoint.runSync({
input: {
workflow: myWorkflowJson,
},
});
console.log(result.output.images);Whisper
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"audio": "https://example.com/audio.mp3"
}
}'Generic (curl)
# Synchronous (wait for result)
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": { ... }}'
# Asynchronous (get job ID, poll for result)
curl https://api.runpod.ai/v2/{endpoint-id}/run \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": { ... }}'
# Check async job status
curl https://api.runpod.ai/v2/{endpoint-id}/status/{job-id} \
-H "Authorization: Bearer $RUNPOD_API_KEY"Scaling & Costs
How Scaling Works
| Setting | Default | Description |
|---|---|---|
min_workers | 0 | Minimum active workers. Set to 0 for scale-to-zero. Set to 1+ to eliminate cold starts. |
max_workers | 3 | Maximum concurrent workers. Limits cost and concurrency. |
idle_timeout | 5 | Seconds a worker waits for new requests before shutting down. |
Scale-to-zero (min_workers: 0): No cost when idle, but first requests have a cold start (~30s-5min depending on model size and FlashBoot).
Always warm (min_workers: 1): No cold starts, but you pay for at least one worker continuously.
Cost Comparison
| Approach | Cost Model | Cold Starts | Management |
|---|---|---|---|
RunPod Pods (gpu run) | Pay while pod is active | None (pod stays warm) | Manual stop/start |
Serverless (gpu serverless) | Pay per compute second | Yes (configurable) | Automatic scaling |
| Replicate / Modal | Pay per request | Yes | Fully managed |
Serverless is best for bursty or unpredictable traffic where you don't want to pay for idle GPU time.
FlashBoot
FlashBoot (enabled by default) caches the container filesystem for faster cold starts. When a worker scales up, it boots from the cached image instead of pulling fresh — reducing cold starts from minutes to seconds for many workloads.
{
"serverless": {
"runpod": {
"flashboot": true
}
}
}Cached Models
For vLLM workloads, RunPod's cached model feature pre-downloads HuggingFace models to host machines and schedules workers on machines that already have the model. This dramatically reduces cold starts.
{
"serverless": {
"runpod": {
"cached_model": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}Recipes
ComfyUI with FLUX
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "comfyui",
"gpu_type": "NVIDIA GeForce RTX 4090",
"scaling": {
"min_workers": 0,
"max_workers": 5,
"idle_timeout": 10
},
"volume": {
"name": "comfyui-models",
"size_gb": 200
},
"prewarm": {
"enabled": true,
"mode": "cpu"
},
"runpod": {
"flashboot": true
}
}
}gpu serverless deploy
gpu serverless warm my-endpoint --cpuvLLM with Llama
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "vllm",
"gpu_type": "NVIDIA A100 80GB PCIe",
"gpu_types": ["NVIDIA L4"],
"scaling": {
"min_workers": 1,
"max_workers": 3,
"idle_timeout": 30
},
"runpod": {
"cached_model": "meta-llama/Llama-3.1-8B-Instruct",
"env": {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}
}gpu serverless deployThen use the OpenAI SDK to call the endpoint (see vLLM section above).
Whisper Transcription
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "whisper",
"gpu_type": "NVIDIA GeForce RTX 4090",
"scaling": {
"min_workers": 0,
"max_workers": 3,
"idle_timeout": 5
}
}
}gpu serverless deployCustom Docker Image
For custom serverless workers, use the custom-image template with your Docker image:
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "custom-image",
"gpu_type": "NVIDIA GeForce RTX 4090",
"scaling": {
"min_workers": 0,
"max_workers": 3
},
"runpod": {
"image_name": "your-registry/your-worker:latest",
"ports": ["8080/http"],
"env": {
"MODEL_PATH": "/models/my-model"
},
"container_disk_gb": 50
}
}
}JSON Output
All serverless commands support --json for machine-readable output:
# Deploy with JSON output
gpu serverless deploy --json
# List as JSON
gpu serverless list --json
# Status as JSON
gpu serverless status my-endpoint --jsonThis is useful for CI/CD pipelines, scripting, and integration with other tools.
Non-Interactive Mode
For CI/CD and automation, commands work without a TTY:
# Deploy without confirmation prompt
gpu serverless deploy -y
# Deploy with JSON output (no TUI)
gpu serverless deploy --json
# Delete without confirmation
gpu serverless delete my-endpoint --forceNext Steps
- Configuration Reference — Full serverless config options
- Commands Reference — All serverless CLI flags
- Quickstart — Getting started with GPU CLI
- Troubleshooting — Common issues and solutions