Troubleshooting

Common issues and their solutions when using GPU CLI.

Connection Issues

Daemon not running

Symptoms:

"Failed to connect to daemon"
"Connection refused"
Commands hang or timeout

Solutions:

Start the daemon manually:

gpu daemon start

Check daemon status:

gpu daemon status

View daemon logs for errors:

gpu daemon logs --tail 50

Restart the daemon:

gpu daemon restart

SSH connection failures

Symptoms:

"Connection refused"
"Host key verification failed"
"Permission denied (publickey)"

Solutions:

Verify your RunPod API key is valid:

gpu auth status

Re-authenticate if needed:

gpu auth login

Check that the pod is running in RunPod console
Generate new SSH keys if corrupted:

gpu auth login --generate-ssh-keys

Timeout errors

Symptoms:

"Operation timed out"
Long waits followed by failures

Solutions:

Check your network connectivity
Verify RunPod service status at status.runpod.io
Increase verbosity to see where it's hanging:

gpu run -vvv python test.py

Pod Provisioning

No GPUs available

Symptoms:

"No pods available matching criteria"
"Unable to find available GPU"

Solutions:

Check current GPU availability:

gpu inventory --available

Try a different GPU type:

gpu run --gpu-type "RTX 3090" python train.py

Use min-vram for flexibility:

gpu run --min-vram 24 python train.py

Remove datacenter constraints (if using network volume in a specific datacenter, GPUs may be limited)
Wait and retry - GPU availability changes frequently

Datacenter constraint failures

Symptoms:

"No pods available in datacenter X"
Pod fails to start despite GPUs showing available

Cause: When using a network volume, pods must be created in the same datacenter as the volume.

Solutions:

Check your volume's datacenter:

gpu volume list --detailed

Check GPU availability in that datacenter:

gpu inventory --available --region <DATACENTER>

Create a new volume in a datacenter with better GPU availability:

gpu volume create --name new-volume --datacenter US-OR-1
gpu volume set-global new-volume

Temporarily disable network volume:

{
  "volume_mode": "none"
}

Pod starts but command fails immediately

Symptoms:

Pod provisions successfully but job exits instantly
"Command not found" errors

Solutions:

Check that your command is correct:

# Wrong - missing python
gpu run train.py

# Correct
gpu run python train.py

Verify dependencies are installed in your Dockerfile or gpu.jsonc:

{
  "environment": {
    "python": {
      "requirements": "requirements.txt"
    }
  }
}

Check that files synced correctly:

gpu run -i bash
# Then inspect /workspace

File Sync Issues

Files not syncing to pod

Symptoms:

"File not found" errors on pod
Old versions of files running

Causes:

File excluded by .gitignore
Large files taking time to sync
Sync errors

Solutions:

Check .gitignore patterns - gitignored files don't sync:

cat .gitignore

Force a full sync:

gpu run --force-sync python train.py

Show detailed sync progress:

gpu run --show-sync python train.py

For large files, use network volumes or downloads instead:

{
  "download": [
    { "strategy": "hf", "source": "model-name" }
  ]
}

Outputs not syncing back

Symptoms:

Job completes but output files are missing locally
Only some outputs appear

Solutions:

Check your output patterns in gpu.jsonc:

{
  "outputs": [
    "outputs/",
    "checkpoints/",
    "*.pt",
    "*.safetensors"
  ]
}

Verify patterns match actual paths (relative to workspace):

# If your script writes to ./results/model.pt
# Pattern should be:
"outputs": ["results/"]

Debug output syncing:

gpu run --show-outputs python train.py

Ensure outputs aren't excluded:

{
  "exclude_outputs": [
    "*.tmp",
    "*.log"
  ]
}

Wait for sync to complete:

gpu run --sync python train.py

Sync is slow

Causes:

Large number of small files
Large files being synced repeatedly

Solutions:

Add unnecessary files to .gitignore:

__pycache__/
.venv/
node_modules/
*.pyc

Use network volumes for large datasets instead of syncing
Use the download feature for models:

{
  "download": [
    { "strategy": "hf", "source": "stabilityai/sdxl-base-1.0" }
  ]
}

Volume Issues

Stale volume reference

Symptoms:

"Volume not found"
"Invalid volume ID"
Errors after deleting volume via RunPod console

Cause: Local config has a reference to a volume that was deleted externally.

Solution: GPU CLI auto-reconciles volume references. Simply run:

gpu volume list

This triggers reconciliation and clears stale references automatically.

Volume not mounting

Symptoms:

/runpod-volume is empty
"Volume not attached" errors

Solutions:

Verify the volume exists:

gpu volume list

Check pod is in the same datacenter as volume:

gpu volume list --detailed

Ensure volume isn't attached to another active pod (volumes can only attach to one pod at a time)
Check volume_mode in your config:

{
  "volume_mode": "global"
}

Volume full

Symptoms:

"No space left on device"
Write operations failing

Solutions:

Check volume usage:

gpu volume status

Clean up unused files:

gpu run -i bash
# Then: rm -rf /runpod-volume/unused-model/

Extend the volume:

gpu volume extend <VOLUME> --size 500

Volume reconciliation

GPU CLI automatically syncs volume state with the cloud provider:

Deleted volumes: If you delete a volume via the RunPod console, GPU CLI detects this and clears stale references
Metadata updates: Volume name and datacenter changes are synced automatically
When it runs: On gpu volume list, gpu run, daemon startup, and every 15 minutes

This means you can safely manage volumes via the RunPod web console.

Authentication Issues

API key errors

Symptoms:

"Unauthorized"
"Invalid API key"
"Authentication failed"

Solutions:

Check authentication status:

gpu auth status

Re-authenticate:

gpu auth login

Verify your API key in RunPod Settings
Ensure the API key has the correct permissions (read/write access to pods)

Hub credential errors

Symptoms:

"401 Unauthorized" when downloading from HuggingFace
"Invalid token" for Civitai downloads

Solutions:

Add or update credentials:

gpu auth add hf
gpu auth add civitai

Verify your token has required scopes:
- HuggingFace: Needs read access for gated models
- Civitai: API key from account settings
Check configured hubs:

gpu auth hubs

Remove and re-add credentials:

gpu auth remove hf
gpu auth add hf

Build Issues

Dockerfile build fails

Symptoms:

"Build failed"
Docker layer errors

Solutions:

Validate your Dockerfile locally first:

docker build -t test .

Check for common issues:
- Missing base image
- Invalid commands
- Network issues in build
Force a rebuild:

gpu run --rebuild python train.py

Check daemon logs for detailed errors:

gpu daemon logs

Build is slow

Solutions:

Use a pre-built base image:

{
  "docker_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
}

Order Dockerfile commands for better caching (put rarely-changing commands first)
Use network volumes to persist installed packages

Performance Issues

Job running slower than expected

Solutions:

Verify GPU is being used:

gpu run python -c "import torch; print(torch.cuda.is_available())"

Check GPU utilization:

gpu run -i bash
# Then: nvidia-smi

Ensure CUDA is enabled in your code:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

High costs

Solutions:

Enable auto-stop (default 5 minutes):

{
  "cooldown_minutes": 5
}

Use spot/community instances:

gpu run --cloud-type community python train.py

Choose appropriate GPU size - don't overprovision
Stop pods when not needed:

gpu stop

Getting More Help

Verbose output: Add -v, -vv, or -vvv for more details:

gpu run -vvv python train.py

Daemon logs: Check for backend errors:

gpu daemon logs --tail 100

Status check: See current state:

gpu status
gpu daemon status

Community support: Join our Discord for help and announcements
Reddit: Ask questions or share workflows in r/gpucli
Report issues: github.com/gpu-cli/gpu

Troubleshooting

On this page