Troubleshooting
Common issues and solutions for GPU CLI
Troubleshooting
Common issues and their solutions when using GPU CLI.
Connection Issues
Daemon not running
Symptoms:
- "Failed to connect to daemon"
- "Connection refused"
- Commands hang or timeout
Solutions:
- Start the daemon manually:
gpu daemon start- Check daemon status:
gpu daemon status- View daemon logs for errors:
gpu daemon logs --tail 50- Restart the daemon:
gpu daemon restartSSH connection failures
Symptoms:
- "Connection refused"
- "Host key verification failed"
- "Permission denied (publickey)"
Solutions:
- Verify your RunPod API key is valid:
gpu auth status- Re-authenticate if needed:
gpu auth login-
Check that the pod is running in RunPod console
-
Generate new SSH keys if corrupted:
gpu auth login --generate-ssh-keysTimeout errors
Symptoms:
- "Operation timed out"
- Long waits followed by failures
Solutions:
-
Check your network connectivity
-
Verify RunPod service status at status.runpod.io
-
Increase verbosity to see where it's hanging:
gpu run -vvv python test.pyPod Provisioning
No GPUs available
Symptoms:
- "No pods available matching criteria"
- "Unable to find available GPU"
Solutions:
- Check current GPU availability:
gpu inventory --available- Try a different GPU type:
gpu run --gpu-type "RTX 3090" python train.py- Use min-vram for flexibility:
gpu run --min-vram 24 python train.py-
Remove datacenter constraints (if using network volume in a specific datacenter, GPUs may be limited)
-
Wait and retry - GPU availability changes frequently
Datacenter constraint failures
Symptoms:
- "No pods available in datacenter X"
- Pod fails to start despite GPUs showing available
Cause: When using a network volume, pods must be created in the same datacenter as the volume.
Solutions:
- Check your volume's datacenter:
gpu volume list --detailed- Check GPU availability in that datacenter:
gpu inventory --available --region <DATACENTER>- Create a new volume in a datacenter with better GPU availability:
gpu volume create --name new-volume --datacenter US-OR-1
gpu volume set-global new-volume- Temporarily disable network volume:
{
"volume_mode": "none"
}Pod starts but command fails immediately
Symptoms:
- Pod provisions successfully but job exits instantly
- "Command not found" errors
Solutions:
- Check that your command is correct:
# Wrong - missing python
gpu run train.py
# Correct
gpu run python train.py- Verify dependencies are installed in your Dockerfile or gpu.jsonc:
{
"environment": {
"python": {
"requirements": "requirements.txt"
}
}
}- Check that files synced correctly:
gpu run -i bash
# Then inspect /workspaceFile Sync Issues
Files not syncing to pod
Symptoms:
- "File not found" errors on pod
- Old versions of files running
Causes:
- File excluded by .gitignore
- Large files taking time to sync
- Sync errors
Solutions:
- Check .gitignore patterns - gitignored files don't sync:
cat .gitignore- Force a full sync:
gpu run --force-sync python train.py- Show detailed sync progress:
gpu run --show-sync python train.py- For large files, use network volumes or downloads instead:
{
"download": [
{ "strategy": "hf", "source": "model-name" }
]
}Outputs not syncing back
Symptoms:
- Job completes but output files are missing locally
- Only some outputs appear
Solutions:
- Check your output patterns in gpu.jsonc:
{
"outputs": [
"outputs/",
"checkpoints/",
"*.pt",
"*.safetensors"
]
}- Verify patterns match actual paths (relative to workspace):
# If your script writes to ./results/model.pt
# Pattern should be:
"outputs": ["results/"]- Debug output syncing:
gpu run --show-outputs python train.py- Ensure outputs aren't excluded:
{
"exclude_outputs": [
"*.tmp",
"*.log"
]
}- Wait for sync to complete:
gpu run --sync python train.pySync is slow
Causes:
- Large number of small files
- Large files being synced repeatedly
Solutions:
- Add unnecessary files to .gitignore:
__pycache__/
.venv/
node_modules/
*.pyc-
Use network volumes for large datasets instead of syncing
-
Use the
downloadfeature for models:
{
"download": [
{ "strategy": "hf", "source": "stabilityai/sdxl-base-1.0" }
]
}Volume Issues
Stale volume reference
Symptoms:
- "Volume not found"
- "Invalid volume ID"
- Errors after deleting volume via RunPod console
Cause: Local config has a reference to a volume that was deleted externally.
Solution: GPU CLI auto-reconciles volume references. Simply run:
gpu volume listThis triggers reconciliation and clears stale references automatically.
Volume not mounting
Symptoms:
/runpod-volumeis empty- "Volume not attached" errors
Solutions:
- Verify the volume exists:
gpu volume list- Check pod is in the same datacenter as volume:
gpu volume list --detailed-
Ensure volume isn't attached to another active pod (volumes can only attach to one pod at a time)
-
Check volume_mode in your config:
{
"volume_mode": "global"
}Volume full
Symptoms:
- "No space left on device"
- Write operations failing
Solutions:
- Check volume usage:
gpu volume status- Clean up unused files:
gpu run -i bash
# Then: rm -rf /runpod-volume/unused-model/- Extend the volume:
gpu volume extend <VOLUME> --size 500Volume reconciliation
GPU CLI automatically syncs volume state with the cloud provider:
- Deleted volumes: If you delete a volume via the RunPod console, GPU CLI detects this and clears stale references
- Metadata updates: Volume name and datacenter changes are synced automatically
- When it runs: On
gpu volume list,gpu run, daemon startup, and every 15 minutes
This means you can safely manage volumes via the RunPod web console.
Authentication Issues
API key errors
Symptoms:
- "Unauthorized"
- "Invalid API key"
- "Authentication failed"
Solutions:
- Check authentication status:
gpu auth status- Re-authenticate:
gpu auth login-
Verify your API key in RunPod Settings
-
Ensure the API key has the correct permissions (read/write access to pods)
Hub credential errors
Symptoms:
- "401 Unauthorized" when downloading from HuggingFace
- "Invalid token" for Civitai downloads
Solutions:
- Add or update credentials:
gpu auth add hf
gpu auth add civitai-
Verify your token has required scopes:
- HuggingFace: Needs read access for gated models
- Civitai: API key from account settings
-
Check configured hubs:
gpu auth hubs- Remove and re-add credentials:
gpu auth remove hf
gpu auth add hfBuild Issues
Dockerfile build fails
Symptoms:
- "Build failed"
- Docker layer errors
Solutions:
- Validate your Dockerfile locally first:
docker build -t test .-
Check for common issues:
- Missing base image
- Invalid commands
- Network issues in build
-
Force a rebuild:
gpu run --rebuild python train.py- Check daemon logs for detailed errors:
gpu daemon logsBuild is slow
Solutions:
- Use a pre-built base image:
{
"docker_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
}-
Order Dockerfile commands for better caching (put rarely-changing commands first)
-
Use network volumes to persist installed packages
Performance Issues
Job running slower than expected
Solutions:
- Verify GPU is being used:
gpu run python -c "import torch; print(torch.cuda.is_available())"- Check GPU utilization:
gpu run -i bash
# Then: nvidia-smi- Ensure CUDA is enabled in your code:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)High costs
Solutions:
- Enable auto-stop (default 5 minutes):
{
"cooldown_minutes": 5
}- Use spot/community instances:
gpu run --cloud-type community python train.py-
Choose appropriate GPU size - don't overprovision
-
Stop pods when not needed:
gpu stopGetting More Help
- Verbose output: Add
-v,-vv, or-vvvfor more details:
gpu run -vvv python train.py- Daemon logs: Check for backend errors:
gpu daemon logs --tail 100- Status check: See current state:
gpu status
gpu daemon status- Community support: Join our Discord for help and announcements
- Reddit: Ask questions or share workflows in r/gpucli
- Report issues: github.com/gpu-cli/gpu