GPU CLI

Troubleshooting

Common issues and solutions for GPU CLI

Troubleshooting

Common issues and their solutions when using GPU CLI.

Connection Issues

Daemon not running

Symptoms:

  • "Failed to connect to daemon"
  • "Connection refused"
  • Commands hang or timeout

Solutions:

  1. Start the daemon manually:
gpu daemon start
  1. Check daemon status:
gpu daemon status
  1. View daemon logs for errors:
gpu daemon logs --tail 50
  1. Restart the daemon:
gpu daemon restart

SSH connection failures

Symptoms:

  • "Connection refused"
  • "Host key verification failed"
  • "Permission denied (publickey)"

Solutions:

  1. Verify your RunPod API key is valid:
gpu auth status
  1. Re-authenticate if needed:
gpu auth login
  1. Check that the pod is running in RunPod console

  2. Generate new SSH keys if corrupted:

gpu auth login --generate-ssh-keys

Timeout errors

Symptoms:

  • "Operation timed out"
  • Long waits followed by failures

Solutions:

  1. Check your network connectivity

  2. Verify RunPod service status at status.runpod.io

  3. Increase verbosity to see where it's hanging:

gpu run -vvv python test.py

Pod Provisioning

No GPUs available

Symptoms:

  • "No pods available matching criteria"
  • "Unable to find available GPU"

Solutions:

  1. Check current GPU availability:
gpu inventory --available
  1. Try a different GPU type:
gpu run --gpu-type "RTX 3090" python train.py
  1. Use min-vram for flexibility:
gpu run --min-vram 24 python train.py
  1. Remove datacenter constraints (if using network volume in a specific datacenter, GPUs may be limited)

  2. Wait and retry - GPU availability changes frequently

Datacenter constraint failures

Symptoms:

  • "No pods available in datacenter X"
  • Pod fails to start despite GPUs showing available

Cause: When using a network volume, pods must be created in the same datacenter as the volume.

Solutions:

  1. Check your volume's datacenter:
gpu volume list --detailed
  1. Check GPU availability in that datacenter:
gpu inventory --available --region <DATACENTER>
  1. Create a new volume in a datacenter with better GPU availability:
gpu volume create --name new-volume --datacenter US-OR-1
gpu volume set-global new-volume
  1. Temporarily disable network volume:
{
  "volume_mode": "none"
}

Pod starts but command fails immediately

Symptoms:

  • Pod provisions successfully but job exits instantly
  • "Command not found" errors

Solutions:

  1. Check that your command is correct:
# Wrong - missing python
gpu run train.py

# Correct
gpu run python train.py
  1. Verify dependencies are installed in your Dockerfile or gpu.jsonc:
{
  "environment": {
    "python": {
      "requirements": "requirements.txt"
    }
  }
}
  1. Check that files synced correctly:
gpu run -i bash
# Then inspect /workspace

File Sync Issues

Files not syncing to pod

Symptoms:

  • "File not found" errors on pod
  • Old versions of files running

Causes:

  • File excluded by .gitignore
  • Large files taking time to sync
  • Sync errors

Solutions:

  1. Check .gitignore patterns - gitignored files don't sync:
cat .gitignore
  1. Force a full sync:
gpu run --force-sync python train.py
  1. Show detailed sync progress:
gpu run --show-sync python train.py
  1. For large files, use network volumes or downloads instead:
{
  "download": [
    { "strategy": "hf", "source": "model-name" }
  ]
}

Outputs not syncing back

Symptoms:

  • Job completes but output files are missing locally
  • Only some outputs appear

Solutions:

  1. Check your output patterns in gpu.jsonc:
{
  "outputs": [
    "outputs/",
    "checkpoints/",
    "*.pt",
    "*.safetensors"
  ]
}
  1. Verify patterns match actual paths (relative to workspace):
# If your script writes to ./results/model.pt
# Pattern should be:
"outputs": ["results/"]
  1. Debug output syncing:
gpu run --show-outputs python train.py
  1. Ensure outputs aren't excluded:
{
  "exclude_outputs": [
    "*.tmp",
    "*.log"
  ]
}
  1. Wait for sync to complete:
gpu run --sync python train.py

Sync is slow

Causes:

  • Large number of small files
  • Large files being synced repeatedly

Solutions:

  1. Add unnecessary files to .gitignore:
__pycache__/
.venv/
node_modules/
*.pyc
  1. Use network volumes for large datasets instead of syncing

  2. Use the download feature for models:

{
  "download": [
    { "strategy": "hf", "source": "stabilityai/sdxl-base-1.0" }
  ]
}

Volume Issues

Stale volume reference

Symptoms:

  • "Volume not found"
  • "Invalid volume ID"
  • Errors after deleting volume via RunPod console

Cause: Local config has a reference to a volume that was deleted externally.

Solution: GPU CLI auto-reconciles volume references. Simply run:

gpu volume list

This triggers reconciliation and clears stale references automatically.

Volume not mounting

Symptoms:

  • /runpod-volume is empty
  • "Volume not attached" errors

Solutions:

  1. Verify the volume exists:
gpu volume list
  1. Check pod is in the same datacenter as volume:
gpu volume list --detailed
  1. Ensure volume isn't attached to another active pod (volumes can only attach to one pod at a time)

  2. Check volume_mode in your config:

{
  "volume_mode": "global"
}

Volume full

Symptoms:

  • "No space left on device"
  • Write operations failing

Solutions:

  1. Check volume usage:
gpu volume status
  1. Clean up unused files:
gpu run -i bash
# Then: rm -rf /runpod-volume/unused-model/
  1. Extend the volume:
gpu volume extend <VOLUME> --size 500

Volume reconciliation

GPU CLI automatically syncs volume state with the cloud provider:

  • Deleted volumes: If you delete a volume via the RunPod console, GPU CLI detects this and clears stale references
  • Metadata updates: Volume name and datacenter changes are synced automatically
  • When it runs: On gpu volume list, gpu run, daemon startup, and every 15 minutes

This means you can safely manage volumes via the RunPod web console.


Authentication Issues

API key errors

Symptoms:

  • "Unauthorized"
  • "Invalid API key"
  • "Authentication failed"

Solutions:

  1. Check authentication status:
gpu auth status
  1. Re-authenticate:
gpu auth login
  1. Verify your API key in RunPod Settings

  2. Ensure the API key has the correct permissions (read/write access to pods)

Hub credential errors

Symptoms:

  • "401 Unauthorized" when downloading from HuggingFace
  • "Invalid token" for Civitai downloads

Solutions:

  1. Add or update credentials:
gpu auth add hf
gpu auth add civitai
  1. Verify your token has required scopes:

    • HuggingFace: Needs read access for gated models
    • Civitai: API key from account settings
  2. Check configured hubs:

gpu auth hubs
  1. Remove and re-add credentials:
gpu auth remove hf
gpu auth add hf

Build Issues

Dockerfile build fails

Symptoms:

  • "Build failed"
  • Docker layer errors

Solutions:

  1. Validate your Dockerfile locally first:
docker build -t test .
  1. Check for common issues:

    • Missing base image
    • Invalid commands
    • Network issues in build
  2. Force a rebuild:

gpu run --rebuild python train.py
  1. Check daemon logs for detailed errors:
gpu daemon logs

Build is slow

Solutions:

  1. Use a pre-built base image:
{
  "docker_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
}
  1. Order Dockerfile commands for better caching (put rarely-changing commands first)

  2. Use network volumes to persist installed packages


Performance Issues

Job running slower than expected

Solutions:

  1. Verify GPU is being used:
gpu run python -c "import torch; print(torch.cuda.is_available())"
  1. Check GPU utilization:
gpu run -i bash
# Then: nvidia-smi
  1. Ensure CUDA is enabled in your code:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

High costs

Solutions:

  1. Enable auto-stop (default 5 minutes):
{
  "cooldown_minutes": 5
}
  1. Use spot/community instances:
gpu run --cloud-type community python train.py
  1. Choose appropriate GPU size - don't overprovision

  2. Stop pods when not needed:

gpu stop

Getting More Help

  1. Verbose output: Add -v, -vv, or -vvv for more details:
gpu run -vvv python train.py
  1. Daemon logs: Check for backend errors:
gpu daemon logs --tail 100
  1. Status check: See current state:
gpu status
gpu daemon status
  1. Community support: Join our Discord for help and announcements
  2. Reddit: Ask questions or share workflows in r/gpucli
  3. Report issues: github.com/gpu-cli/gpu

On this page