Why I Started Monitoring GPU Memory in My Self-Hosted AI Stack
I run several AI models locally on my Proxmox setup—mostly for text generation and image processing. These models live in Docker containers, and for months, everything worked fine. Then I started noticing something frustrating: after a few days of uptime, inference would slow down. Sometimes requests would hang completely. A container restart always fixed it, but I needed to catch the problem before it affected actual use.
The culprit was GPU memory leaks. Not dramatic crashes, just gradual accumulation until the model couldn't allocate what it needed. I needed a way to detect this automatically and restart containers before they became unusable.
My Setup and What I Actually Run
Here's what I'm working with:
- Proxmox host with an NVIDIA GPU passed through to a VM
- Docker containers running llama.cpp-based models
- Models exposed via OpenAI-compatible APIs
- n8n workflows that depend on these models being responsive
I don't use Docker Model Runner or the newer Compose models syntax. My setup predates that, and I haven't migrated because what I have works. Each model runs as a standard service in Docker Compose.
The Health Check Strategy That Actually Works
Docker Compose has built-in health checks, but the default options weren't enough. I needed to check two things:
- Is the API responding at all?
- Is GPU memory usage climbing beyond safe limits?
Here's the health check section I added to my Compose file:
services:
llama-model:
image: my-llama-image
healthcheck:
test: ["CMD-SHELL", "/health-check.sh"]
interval: 2m
timeout: 10s
retries: 2
start_period: 1m
The start_period matters because model initialization takes time. I don't want false failures during startup.
The Health Check Script
I wrote a simple shell script that runs inside the container. It checks both API health and GPU memory:
#!/bin/bash
# Check if API responds
curl -f http://localhost:8080/health > /dev/null 2>&1
API_STATUS=$?
if [ $API_STATUS -ne 0 ]; then
echo "API health check failed"
exit 1
fi
# Check GPU memory if nvidia-smi is available
if command -v nvidia-smi &> /dev/null; then
USED_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
TOTAL_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
# Calculate percentage
PERCENT=$((USED_MEM * 100 / TOTAL_MEM))
# Fail if memory usage exceeds 90%
if [ $PERCENT -gt 90 ]; then
echo "GPU memory usage too high: ${PERCENT}%"
exit 1
fi
fi
exit 0
This script is copied into the container during the build. The 90% threshold is what worked for my 12GB GPU. Your number will vary based on your hardware and model size.
What Didn't Work Initially
My first attempt used only API health checks. The problem: the API would still respond even when GPU memory was nearly full. Requests would succeed but take minutes instead of seconds. Users would think the service was broken, but Docker saw it as healthy.
I also tried monitoring from outside the container using nvidia-smi on the host, but attributing memory usage to specific containers was messy. Running the check inside the container, where nvidia-smi sees only that container's GPU allocation, was cleaner.
The Restart Policy Problem
Docker's restart policies don't trigger on health check failures by default. A container can be unhealthy indefinitely without restarting. I needed something external to act on health status.
Automatic Restarts with a Monitoring Script
I run a simple monitoring script on the Docker host via cron every 5 minutes:
#!/bin/bash
UNHEALTHY=$(docker ps --filter health=unhealthy --format "{{.Names}}")
if [ -n "$UNHEALTHY" ]; then
for container in $UNHEALTHY; do
echo "$(date): Restarting unhealthy container: $container"
docker restart $container
# Log to a file for tracking
echo "$(date): Restarted $container" >> /var/log/docker-health-restarts.log
done
fi
This runs outside Docker Compose, which means it survives Compose restarts and works across multiple Compose projects.
Why Not Use Docker Events?
I considered using docker events to listen for health status changes, but cron was simpler and more reliable. If the monitoring script crashes, cron restarts it. If I need to debug, the log file shows exactly what happened and when.
GPU Memory Leak Patterns I've Observed
After months of logs, I noticed patterns:
- Memory usage climbs slowly over 3-5 days of normal use
- Large context windows (4k+ tokens) accelerate the leak
- Batch processing multiple requests back-to-back makes it worse
- Simple inference requests don't trigger leaks as quickly
These aren't bugs I can fix—they're characteristics of the inference engine I'm using. The health check system just makes them manageable.
Real-World Impact
Before implementing this:
- I manually restarted containers every few days
- Users would report slow responses, and I'd have to investigate
- Overnight processing jobs would sometimes fail silently
After:
- Containers restart automatically when memory usage gets dangerous
- Restarts happen during low-traffic periods (the 2-minute health check interval means failures are caught quickly but not instantly)
- I get logs showing when and why restarts happened
The system isn't perfect. A restart still causes a brief service interruption. But it's better than gradual degradation that affects all users.
Things I'm Still Figuring Out
This setup works, but there are gaps:
- I don't have a good way to alert on frequent restarts. If a container restarts every hour, something deeper is wrong.
- The health check doesn't distinguish between different types of memory leaks. GPU memory and system memory issues look the same.
- I haven't tested this with multiple GPUs or GPU partitioning (MIG). The nvidia-smi queries might need adjustment.
Key Takeaways from Running This
Health checks need to match your actual failure modes. Generic API pings aren't enough for GPU workloads. You need to check the resources that actually constrain your service.
Restart automation should be simple and external. Don't rely on Docker Compose or orchestration features that might not be available everywhere. A cron job works on any Linux system.
Logs matter more than you think. When a container restarts at 3 AM, you need to know why. Timestamp everything.
Thresholds are environment-specific. My 90% GPU memory limit works for my hardware and models. Yours will be different. Start conservative and adjust based on actual behavior.
Gradual failures are harder to catch than crashes. A memory leak that takes days to manifest won't trigger normal error handling. You need time-based monitoring, not just error-based monitoring.