LLM Inference
Run Ollama or vLLM on GPU pods with local routing and wake-on-request behavior
LLM Inference
GPU CLI ships a pod-based LLM workflow for Ollama and vLLM.
Use it when you want:
- a local Web UI backed by a remote GPU pod
- a local forwarded API port
- wake-on-request behavior after the pod cools down
- a reusable template/session flow without building a serverless endpoint
If you want a deployed RunPod endpoint instead, use Serverless Endpoints.
Choose a Workflow
gpu llm run
The guided path. It launches the LLM wizard, writes a local project directory, and starts the pod-backed service.
gpu llm rungpu use ollama / gpu use vllm
The direct template path. Use this when you want to work with the official templates yourself.
gpu use ollama
gpu use vllmQuick Start
Interactive wizard
gpu llm runThe wizard walks you through:
- choosing Ollama or vLLM
- selecting a model
- reviewing the generated template files and launch settings
Direct launch
Skip the wizard when you already know the engine and model.
gpu llm run --ollama --model deepseek-r1:8b -y
gpu llm run --vllm --url meta-llama/Llama-3.1-8B-Instruct -yModel lookup
gpu llm info deepseek-r1:70b
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct --jsonEngine Differences
| Engine | Best for | API surface | Notes |
|---|---|---|---|
| Ollama | multi-model experimentation | Ollama API plus OpenAI-compatible /v1/* endpoints | supports multiple pulled models in one session |
| vLLM | single-model high-throughput inference | OpenAI-compatible API | optimized for one loaded model at a time |
Port Layout
Ollama
| Port | Purpose |
|---|---|
8080 | Web UI |
11434 | Ollama API and OpenAI-compatible API |
vLLM
| Port | Purpose |
|---|---|
8080 | Web UI |
8000 | OpenAI-compatible API |
These ports are forwarded to localhost, so your client code talks to local URLs while the model runs on the remote pod.
Generated Files and Session State
The LLM workflow generates a local project directory such as llm-ollama or llm-vllm and stores template session state in .gpu/template.json.
That gives you:
- a reusable generated template
- normal
gpu useresume behavior - persistent project-local config and startup files
Wake-on-Request and Persistent Proxy
By default, GPU CLI keeps the local proxy listening after the pod stops. When a new request arrives, GPU CLI resumes the pod and shows a loading page while the service comes back.
Default behavior
- pod cools down after
keep_alive_minutes - local forwarded port stays bound
- a new request resumes the pod
- requests in the normal stopped/resuming path get a loading page and should retry while the pod comes back
Disable it
For one-off runs, you can opt out:
gpu run --no-persistent-proxy python app.pyOr in config:
{
"persistent_proxy": false
}Activity Routing for Long-Lived Apps
Polling-heavy apps can stay awake forever if every request counts as activity. Use rich ports rules to define which requests reset the cooldown timer.
{
"keep_alive_minutes": 20,
"persistent_proxy": true,
"ports": [
{
"port": 8080,
"description": "ui",
"http": {
"activity_paths": ["/api/chat", "/api/generate"],
"ignore_paths": ["/health", "/queue", "/metrics"],
"ignore_methods": ["OPTIONS", "HEAD"]
},
"websocket": {
"data_frames_are_activity": true,
"ping_pong_is_activity": false
}
}
]
}Rules of thumb
- Use
activity_pathsfor the requests that represent real user work. - Put health checks, queue polling, and metrics in
ignore_paths. - Leave WebSocket ping/pong as non-activity.
- Keep
health_check_pathsaligned with any service-level health routes.
Storage Notes
LLM workflows are built around persistent storage so large model downloads survive restarts. That is the main reason the generated flows feel faster after the first run.
Use Configuration if you need to override volume behavior.
Calling the APIs
Local Ollama example
curl http://127.0.0.1:11434/api/tagsLocal vLLM example
curl http://127.0.0.1:8000/v1/modelsOpenAI SDK against local vLLM
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "not-used-locally",
baseURL: "http://127.0.0.1:8000/v1",
});When to Use Serverless Instead
Choose Serverless Endpoints when you need:
- a deployed RunPod endpoint URL
- endpoint autoscaling instead of a pod-backed local proxy
- a production-facing remote API surface
Stay with gpu llm when you want interactive work, local forwarding, and wake-on-request behavior.