Why I Built This Script
I run Ollama models on a local server with a single NVIDIA GPU. The setup works well most of the time, but I kept hitting the same problem: models would crash silently, usually from out-of-memory kills or CUDA context errors. The server would keep running, but the model would be dead. I’d only notice when a request timed out or when I checked logs hours later.
I needed something that could detect these failures automatically and restart the model without me having to SSH in and manually fix it. This wasn’t about uptime for a production service—it was about not losing time to preventable crashes when I’m testing workflows or running batch jobs overnight.
What Actually Causes These Crashes
Two specific failure modes kept appearing in my logs:
OOM Kills
When the Linux kernel decides a process is using too much memory, it sends a SIGKILL via the OOM killer. Ollama doesn’t get a chance to clean up—it just disappears. The parent systemd service or Docker container might still be running, but the actual model process is gone.
I confirmed this by checking dmesg after crashes:
[12345.678] Out of memory: Killed process 9876 (ollama) total-vm:16777216kB
This happened most often when I loaded a model that was slightly too large for available VRAM, or when I ran multiple models simultaneously without thinking about total memory usage.
CUDA Context Errors
The other failure mode was CUDA contexts getting corrupted or stuck. This showed up in Ollama’s logs as errors like:
CUDA error: an illegal memory access was encountered CUDA error: context is destroyed
These happened after the GPU had been under load for a while, or sometimes after the system resumed from sleep (which I don’t do often, but it happened). The Ollama process would still be running, but it couldn’t actually serve requests anymore.
My Detection Strategy
I needed the script to detect both types of failures reliably without false positives.
Checking for OOM Kills
The kernel logs OOM kills to dmesg. I parse the last few minutes of kernel messages and look for the Ollama process name:
check_oom_kill() {
local process_name="ollama"
local recent_minutes=5
if dmesg -T 2>/dev/null | tail -n 200 | grep -q "Out of memory.*Killed process.*${process_name}"; then
return 0 # OOM kill detected
fi
return 1
}
I use dmesg -T to get human-readable timestamps, then check the last 200 lines. This covers roughly the last few minutes on my system. If the grep matches, I know Ollama was OOM killed recently.
Checking CUDA Context Health
For CUDA errors, I check two things: whether the Ollama process is running, and whether it can actually respond to a simple health check request.
check_cuda_context() {
# First, verify process is running
if ! pgrep -x ollama > /dev/null; then
return 1 # Process not running
fi
# Try a simple API call with short timeout
if ! curl -s --max-time 3 http://localhost:11434/api/tags > /dev/null 2>&1; then
# Process exists but not responding - likely CUDA issue
return 1
fi
return 0
}
The /api/tags endpoint is lightweight and should respond quickly if the CUDA context is healthy. If the process exists but this call times out or fails, something is wrong with the GPU state.
The Recovery Process
Once a failure is detected, recovery has to happen in the right order.
For OOM Kills
If the OOM killer took out Ollama, the process is already gone. I just need to restart it. But I also log the event so I can track how often this happens:
restart_after_oom() {
echo "[$(date)] OOM kill detected for Ollama" >> /var/log/ollama-monitor.log
# Clear any stale PID files
rm -f /var/run/ollama.pid
# Restart via systemd
systemctl restart ollama
# Wait for service to be ready
sleep 5
# Verify it came back up
if systemctl is-active --quiet ollama; then
echo "[$(date)] Ollama restarted successfully after OOM" >> /var/log/ollama-monitor.log
else
echo "[$(date)] Failed to restart Ollama after OOM" >> /var/log/ollama-monitor.log
fi
}
I use systemd to restart because that’s how I run Ollama. If you’re running it in Docker, you’d use docker restart instead.
For CUDA Context Errors
This is trickier. Just restarting the Ollama process often doesn’t help because the CUDA context is stuck at the driver level. I have to reset the GPU state first.
My approach:
reset_cuda_context() {
echo "[$(date)] CUDA context error detected, resetting GPU" >> /var/log/ollama-monitor.log
# Stop Ollama first
systemctl stop ollama
sleep 2
# Kill any remaining GPU processes
pkill -9 -f ollama
sleep 1
# Reset NVIDIA GPU (requires nvidia-smi)
if command -v nvidia-smi &> /dev/null; then
nvidia-smi --gpu-reset
sleep 3
fi
# Restart Ollama
systemctl start ollama
sleep 5
# Verify recovery
if curl -s --max-time 5 http://localhost:11434/api/tags > /dev/null 2>&1; then
echo "[$(date)] CUDA context reset successful" >> /var/log/ollama-monitor.log
else
echo "[$(date)] CUDA context reset failed" >> /var/log/ollama-monitor.log
fi
}
The nvidia-smi --gpu-reset command resets the GPU state. This requires root privileges, which is why I run the monitoring script as root via cron or systemd.
Important limitation: GPU reset will kill any other processes using the GPU. If you’re running multiple things on the same GPU, this approach won’t work cleanly. In that case, you’d need a more selective recovery strategy.
The Main Loop
I run this as a systemd service that checks every 60 seconds:
#!/bin/bash
LOG_FILE="/var/log/ollama-monitor.log"
CHECK_INTERVAL=60
while true; do
# Check for OOM kills
if check_oom_kill; then
restart_after_oom
# Check for CUDA issues (only if process is running)
elif ! check_cuda_context; then
reset_cuda_context
fi
sleep $CHECK_INTERVAL
done
The systemd unit file looks like this:
[Unit] Description=Ollama Monitor and Auto-Restart After=ollama.service Requires=ollama.service [Service] Type=simple ExecStart=/usr/local/bin/ollama-monitor.sh Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
I set it to start after the Ollama service and restart automatically if the monitor script itself crashes.
What Didn’t Work
My first version tried to be smarter about detecting CUDA errors by parsing Ollama’s log files in real time. This was unreliable because:
- Log rotation could happen mid-check
- Ollama’s log format changed between versions
- Some CUDA errors appeared in stderr, not the main log
The simple health check approach (just try to call the API) turned out to be much more reliable.
I also tried using nvidia-smi to detect GPU errors proactively, but this gave too many false positives. Temporary GPU load spikes or memory warnings didn’t necessarily mean Ollama was broken.
Current Limitations
This script works for my single-GPU, single-model setup. It has clear limitations:
- GPU reset kills everything on the GPU, not just Ollama
- No handling for multi-model scenarios
- Assumes systemd for service management
- Requires root for GPU reset
- No notification system (just logs)
I’m okay with these trade-offs because they match my actual use case. If I needed something more sophisticated, I’d probably move to a proper monitoring tool like Prometheus with custom exporters.
Key Takeaways
Simple health checks beat complex log parsing. The API endpoint test catches more real failures with fewer false positives than trying to parse error messages.
OOM kills and CUDA errors need different recovery strategies. One is just a process restart, the other requires GPU-level reset.
Running the monitor as a systemd service means it restarts automatically and starts in the right order relative to Ollama itself.
Logging every action made debugging much easier. When I see a pattern of OOM kills, I know I need to reduce model size or add more RAM. When I see CUDA resets, I check for driver issues or cooling problems.
The script isn’t elegant, but it solved the actual problem: I don’t lose hours to silent model crashes anymore.