Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building a bash script to auto-restart crashed ollama models: detecting oom kills and resetting cuda contexts

Why I Built This Script

I run Ollama models on a local server with a single NVIDIA GPU. The setup works well most of the time, but I kept hitting the same problem: models would crash silently, usually from out-of-memory kills or CUDA context errors. The server would keep running, but the model would be dead. I’d only notice when a request timed out or when I checked logs hours later.

I needed something that could detect these failures automatically and restart the model without me having to SSH in and manually fix it. This wasn’t about uptime for a production service—it was about not losing time to preventable crashes when I’m testing workflows or running batch jobs overnight.

What Actually Causes These Crashes

Two specific failure modes kept appearing in my logs:

OOM Kills

When the Linux kernel decides a process is using too much memory, it sends a SIGKILL via the OOM killer. Ollama doesn’t get a chance to clean up—it just disappears. The parent systemd service or Docker container might still be running, but the actual model process is gone.

I confirmed this by checking dmesg after crashes:

[12345.678] Out of memory: Killed process 9876 (ollama) total-vm:16777216kB

This happened most often when I loaded a model that was slightly too large for available VRAM, or when I ran multiple models simultaneously without thinking about total memory usage.

CUDA Context Errors

The other failure mode was CUDA contexts getting corrupted or stuck. This showed up in Ollama’s logs as errors like:

CUDA error: an illegal memory access was encountered
CUDA error: context is destroyed

These happened after the GPU had been under load for a while, or sometimes after the system resumed from sleep (which I don’t do often, but it happened). The Ollama process would still be running, but it couldn’t actually serve requests anymore.

My Detection Strategy

I needed the script to detect both types of failures reliably without false positives.

Checking for OOM Kills

The kernel logs OOM kills to dmesg. I parse the last few minutes of kernel messages and look for the Ollama process name:

check_oom_kill() {
    local process_name="ollama"
    local recent_minutes=5
    
    if dmesg -T 2>/dev/null | tail -n 200 | grep -q "Out of memory.*Killed process.*${process_name}"; then
        return 0  # OOM kill detected
    fi
    return 1
}

I use dmesg -T to get human-readable timestamps, then check the last 200 lines. This covers roughly the last few minutes on my system. If the grep matches, I know Ollama was OOM killed recently.

Checking CUDA Context Health

For CUDA errors, I check two things: whether the Ollama process is running, and whether it can actually respond to a simple health check request.

check_cuda_context() {
    # First, verify process is running
    if ! pgrep -x ollama > /dev/null; then
        return 1  # Process not running
    fi
    
    # Try a simple API call with short timeout
    if ! curl -s --max-time 3 http://localhost:11434/api/tags > /dev/null 2>&1; then
        # Process exists but not responding - likely CUDA issue
        return 1
    fi
    
    return 0
}

The /api/tags endpoint is lightweight and should respond quickly if the CUDA context is healthy. If the process exists but this call times out or fails, something is wrong with the GPU state.

The Recovery Process

Once a failure is detected, recovery has to happen in the right order.

For OOM Kills

If the OOM killer took out Ollama, the process is already gone. I just need to restart it. But I also log the event so I can track how often this happens:

restart_after_oom() {
    echo "[$(date)] OOM kill detected for Ollama" >> /var/log/ollama-monitor.log
    
    # Clear any stale PID files
    rm -f /var/run/ollama.pid
    
    # Restart via systemd
    systemctl restart ollama
    
    # Wait for service to be ready
    sleep 5
    
    # Verify it came back up
    if systemctl is-active --quiet ollama; then
        echo "[$(date)] Ollama restarted successfully after OOM" >> /var/log/ollama-monitor.log
    else
        echo "[$(date)] Failed to restart Ollama after OOM" >> /var/log/ollama-monitor.log
    fi
}

I use systemd to restart because that’s how I run Ollama. If you’re running it in Docker, you’d use docker restart instead.

For CUDA Context Errors

This is trickier. Just restarting the Ollama process often doesn’t help because the CUDA context is stuck at the driver level. I have to reset the GPU state first.

My approach:

reset_cuda_context() {
    echo "[$(date)] CUDA context error detected, resetting GPU" >> /var/log/ollama-monitor.log
    
    # Stop Ollama first
    systemctl stop ollama
    sleep 2
    
    # Kill any remaining GPU processes
    pkill -9 -f ollama
    sleep 1
    
    # Reset NVIDIA GPU (requires nvidia-smi)
    if command -v nvidia-smi &> /dev/null; then
        nvidia-smi --gpu-reset
        sleep 3
    fi
    
    # Restart Ollama
    systemctl start ollama
    sleep 5
    
    # Verify recovery
    if curl -s --max-time 5 http://localhost:11434/api/tags > /dev/null 2>&1; then
        echo "[$(date)] CUDA context reset successful" >> /var/log/ollama-monitor.log
    else
        echo "[$(date)] CUDA context reset failed" >> /var/log/ollama-monitor.log
    fi
}

The nvidia-smi --gpu-reset command resets the GPU state. This requires root privileges, which is why I run the monitoring script as root via cron or systemd.

Important limitation: GPU reset will kill any other processes using the GPU. If you’re running multiple things on the same GPU, this approach won’t work cleanly. In that case, you’d need a more selective recovery strategy.

The Main Loop

I run this as a systemd service that checks every 60 seconds:

#!/bin/bash

LOG_FILE="/var/log/ollama-monitor.log"
CHECK_INTERVAL=60

while true; do
    # Check for OOM kills
    if check_oom_kill; then
        restart_after_oom
    # Check for CUDA issues (only if process is running)
    elif ! check_cuda_context; then
        reset_cuda_context
    fi
    
    sleep $CHECK_INTERVAL
done

The systemd unit file looks like this:

[Unit]
Description=Ollama Monitor and Auto-Restart
After=ollama.service
Requires=ollama.service

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama-monitor.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

I set it to start after the Ollama service and restart automatically if the monitor script itself crashes.

What Didn’t Work

My first version tried to be smarter about detecting CUDA errors by parsing Ollama’s log files in real time. This was unreliable because:

  • Log rotation could happen mid-check
  • Ollama’s log format changed between versions
  • Some CUDA errors appeared in stderr, not the main log

The simple health check approach (just try to call the API) turned out to be much more reliable.

I also tried using nvidia-smi to detect GPU errors proactively, but this gave too many false positives. Temporary GPU load spikes or memory warnings didn’t necessarily mean Ollama was broken.

Current Limitations

This script works for my single-GPU, single-model setup. It has clear limitations:

  • GPU reset kills everything on the GPU, not just Ollama
  • No handling for multi-model scenarios
  • Assumes systemd for service management
  • Requires root for GPU reset
  • No notification system (just logs)

I’m okay with these trade-offs because they match my actual use case. If I needed something more sophisticated, I’d probably move to a proper monitoring tool like Prometheus with custom exporters.

Key Takeaways

Simple health checks beat complex log parsing. The API endpoint test catches more real failures with fewer false positives than trying to parse error messages.

OOM kills and CUDA errors need different recovery strategies. One is just a process restart, the other requires GPU-level reset.

Running the monitor as a systemd service means it restarts automatically and starts in the right order relative to Ollama itself.

Logging every action made debugging much easier. When I see a pattern of OOM kills, I know I need to reduce model size or add more RAM. When I see CUDA resets, I check for driver issues or cooling problems.

The script isn’t elegant, but it solved the actual problem: I don’t lose hours to silent model crashes anymore.

Leave a Comment

Your email address will not be published. Required fields are marked *