Troubleshooting

Unknown Runtime Specified `nvidia`

Symptom: Docker Compose fails with an error similar to:

Error response from daemon: Unknown runtime specified nvidia

Cause: The NVIDIA Container Runtime is not installed or not configured as a Docker runtime.

Fix:

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then retry docker compose up.

Out-of-Memory (OOM)

Symptom: The inference container is killed by the OOM killer, or you see errors such as CUDA out of memory or cudaMalloc failed.

Cause: The model and/or context window exceed the available shared RAM on the Jetson.

Fix:

Reduce N_GPU_LAYERS to offload fewer layers to the GPU (e.g., change from -1 to 20).
Reduce N_CTX to lower the KV cache memory footprint (e.g., 4096 → 2048).
Switch to a smaller or more aggressively quantized model (e.g., Q4_K_M → Q2_K, or 7B → 3B).

See Model Selection & Sizing for recommended settings.

Inference Is Very Slow

Symptom: Token generation is much slower than expected (e.g., < 1 token/second).

Cause: The model is running on CPU only, or clock speeds are throttled due to thermal limits.

Fix:

Verify that N_GPU_LAYERS is set to a non-zero value (use -1 to offload all layers).
Check that the NVIDIA Container Runtime is active and the container has GPU access:
```
docker exec <inference-container-name> nvidia-smi
```
Set the maximum power mode and optionally lock clocks:
```
sudo nvpmodel -m 0
sudo jetson_clocks
```
Check thermal throttling — if the device is overheating it will reduce clock speeds. Ensure adequate cooling.

Model Not Found

Symptom: The inference server fails to start with a message such as model file not found or no such file or directory.

Cause: The GGUF model file is not present at the path specified by MODEL_PATH.

Fix:

Confirm the model file exists on the host:
```
ls -lh ./models/
```
Check that MODEL_PATH in your .env file matches the actual filename, including the full path relative to the project root.
Ensure the models/ directory is correctly mounted into the container (check the volumes section in docker-compose.jetson.yml).

CUDA Driver Version Mismatch

Symptom: Container logs show an error such as:

CUDA driver version is insufficient for CUDA runtime version

Cause: The JetPack / L4T version on the host does not match the CUDA version expected by the container image.

Fix:

Verify your JetPack version:
```
cat /etc/nv_tegra_release
```
Ensure you are running JetPack 6.x (L4T R36.x). Earlier JetPack versions are not supported.

If you have recently updated JetPack, rebuild the inference image to pick up the correct L4T base:

docker compose -f docker-compose.yml -f docker-compose.jetson.yml build inference
docker compose -f docker-compose.yml -f docker-compose.jetson.yml up -d

Unknown Runtime Specified nvidia​

Out-of-Memory (OOM)​

Inference Is Very Slow​

Model Not Found​

CUDA Driver Version Mismatch​

Unknown Runtime Specified `nvidia`

Out-of-Memory (OOM)

Inference Is Very Slow

Model Not Found

CUDA Driver Version Mismatch