Serverless Endpoints

Deploy ML applications as serverless GPU endpoints. Endpoints scale to zero when idle and wake on request — you only pay for compute time.

GPU CLI wraps RunPod Serverless with a simple CLI and config-driven workflow. RunPod handles scaling, provisioning, and cold start optimization. You focus on your application.

Prerequisites

GPU CLI installed and authenticated
RunPod API key configured (gpu auth login)

Quick Start

1. Create Configuration

In your project directory, create or update gpu.jsonc with a serverless block:

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "comfyui",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "scaling": {
      "min_workers": 0,
      "max_workers": 3,
      "idle_timeout": 5
    }
  }
}

2. Deploy

gpu serverless deploy

GPU CLI will:

Resolve the template (official RunPod worker image)
Create or reuse a network volume for model storage
Create the serverless endpoint with your scaling config
Return the endpoint URL

3. Call Your Endpoint

Once deployed, send requests to your endpoint using the RunPod API:

curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": {"prompt": "a cat in space"}}'

Templates

Serverless endpoints use official RunPod worker images. Specify a template in your config or via the --template flag.

Template	Worker Image	Use Case
`auto`	Auto-detected from project	Default — inspects project files
`comfyui`	RunPod ComfyUI Worker	Image generation workflows
`vllm`	RunPod vLLM Worker	LLM inference (OpenAI-compatible API)
`whisper`	RunPod Whisper Worker	Audio transcription
`custom-image`	Your Docker image	Custom serverless workers

Configuration

The serverless block in gpu.jsonc controls deployment. Fields are split into portable settings (work across providers) and RunPod-specific settings.

See the full Configuration Reference for all options.

Minimal Configuration

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "vllm",
    "gpu_type": "NVIDIA A100 80GB PCIe"
  }
}

Full Configuration

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "comfyui",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "gpu_types": ["NVIDIA L4", "NVIDIA RTX A4000"],
    "scaling": {
      "min_workers": 0,
      "max_workers": 5,
      "idle_timeout": 10
    },
    "volume": {
      "name": "my-project-vol",
      "size_gb": 200,
      "mount_path": "/runpod-volume"
    },
    "prewarm": {
      "enabled": true,
      "mode": "cpu"
    },
    "runpod": {
      "flashboot": true,
      "scaler_type": "queue_delay",
      "scaler_value": 4,
      "execution_timeout_ms": 600000,
      "container_disk_gb": 50,
      "data_center_ids": ["US-TX-3", "CA-MTL-1"],
      "env": {
        "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
      }
    }
  }
}

GPU Fallback

Specify multiple GPU types in priority order. The first available GPU is used:

{
  "serverless": {
    "template": "vllm",
    "gpu_type": "NVIDIA A100 80GB PCIe",
    "gpu_types": ["NVIDIA L4", "NVIDIA GeForce RTX 4090"]
  }
}

The gpu_type field is the primary choice. The gpu_types array provides fallbacks if the primary GPU is unavailable.

Managing Endpoints

List Endpoints

# List all endpoints for current project
gpu serverless list

# List all endpoints across all projects
gpu serverless list --all

# JSON output for scripting
gpu serverless list --json

Check Status

# Status by endpoint name or ID
gpu serverless status my-endpoint

# JSON output
gpu serverless status my-endpoint --json

Status shows worker count, scaling config, queue depth, and endpoint URL.

Pre-Warm Endpoints

Reduce cold-start latency by pre-warming your endpoint. This downloads models to the network volume before the first request.

# CPU warming (recommended) — cheap, caches models to volume
gpu serverless warm my-endpoint --cpu

# GPU warming — tests full inference pipeline
gpu serverless warm my-endpoint

# Custom timeout
gpu serverless warm my-endpoint --timeout 900

CPU mode (~~$0.06/hr) uses a cheap CPU pod to download models. GPU mode (~~$0.40/hr) spins up a real GPU worker. CPU mode is recommended for most use cases.

Delete Endpoints

# Interactive deletion with TUI confirmation
gpu serverless delete my-endpoint

# Force delete (skip confirmation)
gpu serverless delete my-endpoint --force

# Interactive selection from all endpoints
gpu serverless delete

Network volumes are preserved when deleting endpoints.

Delete Templates

Clean up user-owned serverless templates:

# Interactive selection
gpu serverless template delete

# Delete specific template
gpu serverless template delete tmpl_123

# Force delete
gpu serverless template delete tmpl_123 --force

View Logs

gpu serverless logs my-endpoint

Note: Serverless logs are currently available via the RunPod dashboard. Full CLI log streaming is planned for a future release.

Calling Your Endpoint

vLLM (OpenAI-Compatible)

vLLM endpoints expose an OpenAI-compatible API. Use the standard OpenAI SDK:

TypeScript / JavaScript:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.RUNPOD_API_KEY,
  baseURL: `https://api.runpod.ai/v2/${process.env.ENDPOINT_ID}/openai/v1`,
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Python:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["RUNPOD_API_KEY"],
    base_url=f"https://api.runpod.ai/v2/{os.environ['ENDPOINT_ID']}/openai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

ComfyUI

ComfyUI endpoints accept workflow JSON via the RunPod API:

curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "workflow": { ... }
    }
  }'

TypeScript with RunPod SDK:

import Runpod from "runpod-sdk";

Runpod.apiKey = process.env.RUNPOD_API_KEY;
const endpoint = Runpod.endpoint(process.env.ENDPOINT_ID);

const result = await endpoint.runSync({
  input: {
    workflow: myWorkflowJson,
  },
});

console.log(result.output.images);

Whisper

curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "audio": "https://example.com/audio.mp3"
    }
  }'

Generic (curl)

# Synchronous (wait for result)
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": { ... }}'

# Asynchronous (get job ID, poll for result)
curl https://api.runpod.ai/v2/{endpoint-id}/run \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": { ... }}'

# Check async job status
curl https://api.runpod.ai/v2/{endpoint-id}/status/{job-id} \
  -H "Authorization: Bearer $RUNPOD_API_KEY"

Scaling & Costs

How Scaling Works

Setting	Default	Description
`min_workers`	`0`	Minimum active workers. Set to 0 for scale-to-zero. Set to 1+ to eliminate cold starts.
`max_workers`	`3`	Maximum concurrent workers. Limits cost and concurrency.
`idle_timeout`	`5`	Seconds a worker waits for new requests before shutting down.

Scale-to-zero (min_workers: 0): No cost when idle, but first requests have a cold start (~30s-5min depending on model size and FlashBoot).

Always warm (min_workers: 1): No cold starts, but you pay for at least one worker continuously.

Cost Comparison

Approach	Cost Model	Cold Starts	Management
RunPod Pods (`gpu run`)	Pay while pod is active	None (pod stays warm)	Manual stop/start
Serverless (`gpu serverless`)	Pay per compute second	Yes (configurable)	Automatic scaling
Replicate / Modal	Pay per request	Yes	Fully managed

Serverless is best for bursty or unpredictable traffic where you don't want to pay for idle GPU time.

FlashBoot (enabled by default) caches the container filesystem for faster cold starts. When a worker scales up, it boots from the cached image instead of pulling fresh — reducing cold starts from minutes to seconds for many workloads.

{
  "serverless": {
    "runpod": {
      "flashboot": true
    }
  }
}

Cached Models

For vLLM workloads, RunPod's cached model feature pre-downloads HuggingFace models to host machines and schedules workers on machines that already have the model. This dramatically reduces cold starts.

{
  "serverless": {
    "runpod": {
      "cached_model": "meta-llama/Llama-3.1-8B-Instruct"
    }
  }
}

Recipes

ComfyUI with FLUX

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "comfyui",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "scaling": {
      "min_workers": 0,
      "max_workers": 5,
      "idle_timeout": 10
    },
    "volume": {
      "name": "comfyui-models",
      "size_gb": 200
    },
    "prewarm": {
      "enabled": true,
      "mode": "cpu"
    },
    "runpod": {
      "flashboot": true
    }
  }
}

gpu serverless deploy
gpu serverless warm my-endpoint --cpu

vLLM with Llama

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "vllm",
    "gpu_type": "NVIDIA A100 80GB PCIe",
    "gpu_types": ["NVIDIA L4"],
    "scaling": {
      "min_workers": 1,
      "max_workers": 3,
      "idle_timeout": 30
    },
    "runpod": {
      "cached_model": "meta-llama/Llama-3.1-8B-Instruct",
      "env": {
        "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
      }
    }
  }
}

gpu serverless deploy

Then use the OpenAI SDK to call the endpoint (see vLLM section above).

Whisper Transcription

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "whisper",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "scaling": {
      "min_workers": 0,
      "max_workers": 3,
      "idle_timeout": 5
    }
  }
}

gpu serverless deploy

Custom Docker Image

For custom serverless workers, use the custom-image template with your Docker image:

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "custom-image",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "scaling": {
      "min_workers": 0,
      "max_workers": 3
    },
    "runpod": {
      "image_name": "your-registry/your-worker:latest",
      "ports": ["8080/http"],
      "env": {
        "MODEL_PATH": "/models/my-model"
      },
      "container_disk_gb": 50
    }
  }
}

JSON Output

All serverless commands support --json for machine-readable output:

# Deploy with JSON output
gpu serverless deploy --json

# List as JSON
gpu serverless list --json

# Status as JSON
gpu serverless status my-endpoint --json

This is useful for CI/CD pipelines, scripting, and integration with other tools.

Non-Interactive Mode

For CI/CD and automation, commands work without a TTY:

# Deploy without confirmation prompt
gpu serverless deploy -y

# Deploy with JSON output (no TUI)
gpu serverless deploy --json

# Delete without confirmation
gpu serverless delete my-endpoint --force

Next Steps

Configuration Reference — Full serverless config options
Commands Reference — All serverless CLI flags
Quickstart — Getting started with GPU CLI
Troubleshooting — Common issues and solutions

Serverless Endpoints

On this page