Troubleshooting
Common issues and current limitations when using GPU CLI
Troubleshooting
Use this page for the issues that show up most often in the current GPU CLI workflow.
Daemon and Connection Issues
Daemon not running
Symptoms
Failed to connect to daemonConnection refused- commands hang before job submission
Fix
gpu daemon start
gpu daemon status
gpu daemon logs --tail 50
gpu daemon restartSSH connection failures
Symptoms
Host key verification failedPermission denied (publickey)- repeated SSH retries during pod setup
Fix
gpu auth status
gpu auth login
gpu auth login --generate-ssh-keysAlso verify the provider API key you entered is valid and that the pod actually reached a running state.
Headless and CI Issues
USER is missing
Symptoms
Failed to detect project: Cannot detect usernamegpu runfails early in CI, Railway, or containers
Fix
Set USER explicitly to a stable service identity:
export USER=railway-image-generatorGood values:
- service account name
- service name
- infrastructure workload name
Sync Issues
Files are missing on the pod
GPU CLI excludes files that match .gitignore unless you explicitly include them.
Fix
- check
.gitignore - use
includeingpu.jsoncfor gitignored files you still want synced - use
gpu run --force-sync ...when you need a clean full sync - use
gpu run --show-sync ...for detailed sync progress
Outputs are not syncing back
Make sure the output path is covered by outputs, not ignored by exclude_outputs, and still exists inside the remote workspace.
Useful commands:
gpu run --outputs python train.py
gpu logs --type syncWake-on-Request and Port Routing
A web UI or API keeps the pod awake forever
By default, all HTTP requests on a forwarded port count as activity. That means polling paths such as /health, /queue, or /metrics can keep the cooldown timer alive.
Fix
Use rich ports rules:
{
"ports": [
{
"port": 8080,
"http": {
"activity_paths": ["/api/chat"],
"ignore_paths": ["/health", "/queue"],
"ignore_methods": ["OPTIONS", "HEAD"]
}
}
]
}See LLM Inference and Configuration for the full model.
My forwarded app does not wake back up
Check these points:
- make sure you did not use
gpu run --no-persistent-proxy - make sure
persistent_proxyis not disabled in config - if you use
activity_paths, confirm the incoming request path actually matches - if the app relies on WebSocket traffic only, remember that WebSocket frames keep the connection warm but do not reset the HTTP cooldown timer by themselves
Serverless Limitations
CPU warmup is not working
gpu serverless warm --cpu is not implemented in the current runtime.
Use GPU warmup instead:
gpu serverless warm <ENDPOINT_ID> --gpuDeploy-time --warm or --write-ids has no effect
Those flags are accepted by the CLI but are not wired through the current runtime yet.
gpu serverless status or gpu serverless warm fails for a name
Treat both commands as endpoint-ID-driven today:
gpu serverless status <ENDPOINT_ID>
gpu serverless warm <ENDPOINT_ID> --gpuI expected CLI log streaming for serverless
gpu serverless logs currently points you to the RunPod dashboard instead of streaming filtered endpoint logs in the CLI.
Storage and Keychain
Keychain corruption (aead::Error)
If every gpu command fails with Decryption failed: aead::Error, the encrypted keychain file is corrupted.
rm ~/.gpu-cli-dev/keychain.enc
gpu auth loginFor production mode, use ~/.gpu-cli/keychain.enc instead.
Then restart the daemon:
pkill -f gpudStill Stuck?
gpu doctorgpu agent-docsgpu issuegpu support