GPU CLI

Troubleshooting

Common issues and current limitations when using GPU CLI

Troubleshooting

Use this page for the issues that show up most often in the current GPU CLI workflow.

Daemon and Connection Issues

Daemon not running

Symptoms

  • Failed to connect to daemon
  • Connection refused
  • commands hang before job submission

Fix

gpu daemon start
gpu daemon status
gpu daemon logs --tail 50
gpu daemon restart

SSH connection failures

Symptoms

  • Host key verification failed
  • Permission denied (publickey)
  • repeated SSH retries during pod setup

Fix

gpu auth status
gpu auth login
gpu auth login --generate-ssh-keys

Also verify the provider API key you entered is valid and that the pod actually reached a running state.

Headless and CI Issues

USER is missing

Symptoms

  • Failed to detect project: Cannot detect username
  • gpu run fails early in CI, Railway, or containers

Fix

Set USER explicitly to a stable service identity:

export USER=railway-image-generator

Good values:

  • service account name
  • service name
  • infrastructure workload name

Sync Issues

Files are missing on the pod

GPU CLI excludes files that match .gitignore unless you explicitly include them.

Fix

  • check .gitignore
  • use include in gpu.jsonc for gitignored files you still want synced
  • use gpu run --force-sync ... when you need a clean full sync
  • use gpu run --show-sync ... for detailed sync progress

Outputs are not syncing back

Make sure the output path is covered by outputs, not ignored by exclude_outputs, and still exists inside the remote workspace.

Useful commands:

gpu run --outputs python train.py
gpu logs --type sync

Wake-on-Request and Port Routing

A web UI or API keeps the pod awake forever

By default, all HTTP requests on a forwarded port count as activity. That means polling paths such as /health, /queue, or /metrics can keep the cooldown timer alive.

Fix

Use rich ports rules:

{
  "ports": [
    {
      "port": 8080,
      "http": {
        "activity_paths": ["/api/chat"],
        "ignore_paths": ["/health", "/queue"],
        "ignore_methods": ["OPTIONS", "HEAD"]
      }
    }
  ]
}

See LLM Inference and Configuration for the full model.

My forwarded app does not wake back up

Check these points:

  • make sure you did not use gpu run --no-persistent-proxy
  • make sure persistent_proxy is not disabled in config
  • if you use activity_paths, confirm the incoming request path actually matches
  • if the app relies on WebSocket traffic only, remember that WebSocket frames keep the connection warm but do not reset the HTTP cooldown timer by themselves

Serverless Limitations

CPU warmup is not working

gpu serverless warm --cpu is not implemented in the current runtime.

Use GPU warmup instead:

gpu serverless warm <ENDPOINT_ID> --gpu

Deploy-time --warm or --write-ids has no effect

Those flags are accepted by the CLI but are not wired through the current runtime yet.

gpu serverless status or gpu serverless warm fails for a name

Treat both commands as endpoint-ID-driven today:

gpu serverless status <ENDPOINT_ID>
gpu serverless warm <ENDPOINT_ID> --gpu

I expected CLI log streaming for serverless

gpu serverless logs currently points you to the RunPod dashboard instead of streaming filtered endpoint logs in the CLI.

Storage and Keychain

Keychain corruption (aead::Error)

If every gpu command fails with Decryption failed: aead::Error, the encrypted keychain file is corrupted.

rm ~/.gpu-cli-dev/keychain.enc
gpu auth login

For production mode, use ~/.gpu-cli/keychain.enc instead.

Then restart the daemon:

pkill -f gpud

Still Stuck?

  • gpu doctor
  • gpu agent-docs
  • gpu issue
  • gpu support

On this page