GPU CLI

LLM Inference

Run Ollama or vLLM on GPU pods with local routing and wake-on-request behavior

LLM Inference

GPU CLI ships a pod-based LLM workflow for Ollama and vLLM.

Use it when you want:

  • a local Web UI backed by a remote GPU pod
  • a local forwarded API port
  • wake-on-request behavior after the pod cools down
  • a reusable template/session flow without building a serverless endpoint

If you want a deployed RunPod endpoint instead, use Serverless Endpoints.

Choose a Workflow

gpu llm run

The guided path. It launches the LLM wizard, writes a local project directory, and starts the pod-backed service.

gpu llm run

gpu use ollama / gpu use vllm

The direct template path. Use this when you want to work with the official templates yourself.

gpu use ollama
gpu use vllm

Quick Start

Interactive wizard

gpu llm run

The wizard walks you through:

  • choosing Ollama or vLLM
  • selecting a model
  • reviewing the generated template files and launch settings

Direct launch

Skip the wizard when you already know the engine and model.

gpu llm run --ollama --model deepseek-r1:8b -y
gpu llm run --vllm --url meta-llama/Llama-3.1-8B-Instruct -y

Model lookup

gpu llm info deepseek-r1:70b
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct --json

Engine Differences

EngineBest forAPI surfaceNotes
Ollamamulti-model experimentationOllama API plus OpenAI-compatible /v1/* endpointssupports multiple pulled models in one session
vLLMsingle-model high-throughput inferenceOpenAI-compatible APIoptimized for one loaded model at a time

Port Layout

Ollama

PortPurpose
8080Web UI
11434Ollama API and OpenAI-compatible API

vLLM

PortPurpose
8080Web UI
8000OpenAI-compatible API

These ports are forwarded to localhost, so your client code talks to local URLs while the model runs on the remote pod.

Generated Files and Session State

The LLM workflow generates a local project directory such as llm-ollama or llm-vllm and stores template session state in .gpu/template.json.

That gives you:

  • a reusable generated template
  • normal gpu use resume behavior
  • persistent project-local config and startup files

Wake-on-Request and Persistent Proxy

By default, GPU CLI keeps the local proxy listening after the pod stops. When a new request arrives, GPU CLI resumes the pod and shows a loading page while the service comes back.

Default behavior

  • pod cools down after keep_alive_minutes
  • local forwarded port stays bound
  • a new request resumes the pod
  • requests in the normal stopped/resuming path get a loading page and should retry while the pod comes back

Disable it

For one-off runs, you can opt out:

gpu run --no-persistent-proxy python app.py

Or in config:

{
  "persistent_proxy": false
}

Activity Routing for Long-Lived Apps

Polling-heavy apps can stay awake forever if every request counts as activity. Use rich ports rules to define which requests reset the cooldown timer.

{
  "keep_alive_minutes": 20,
  "persistent_proxy": true,
  "ports": [
    {
      "port": 8080,
      "description": "ui",
      "http": {
        "activity_paths": ["/api/chat", "/api/generate"],
        "ignore_paths": ["/health", "/queue", "/metrics"],
        "ignore_methods": ["OPTIONS", "HEAD"]
      },
      "websocket": {
        "data_frames_are_activity": true,
        "ping_pong_is_activity": false
      }
    }
  ]
}

Rules of thumb

  • Use activity_paths for the requests that represent real user work.
  • Put health checks, queue polling, and metrics in ignore_paths.
  • Leave WebSocket ping/pong as non-activity.
  • Keep health_check_paths aligned with any service-level health routes.

Storage Notes

LLM workflows are built around persistent storage so large model downloads survive restarts. That is the main reason the generated flows feel faster after the first run.

Use Configuration if you need to override volume behavior.

Calling the APIs

Local Ollama example

curl http://127.0.0.1:11434/api/tags

Local vLLM example

curl http://127.0.0.1:8000/v1/models

OpenAI SDK against local vLLM

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "not-used-locally",
  baseURL: "http://127.0.0.1:8000/v1",
});

When to Use Serverless Instead

Choose Serverless Endpoints when you need:

  • a deployed RunPod endpoint URL
  • endpoint autoscaling instead of a pod-backed local proxy
  • a production-facing remote API surface

Stay with gpu llm when you want interactive work, local forwarding, and wake-on-request behavior.

On this page