Setting Up Docker Compose Health Checks for Self-Hosted AI Models That Monitor GPU Memory Leaks and Trigger Automatic Restarts

Why I Started Monitoring GPU Memory in My Self-Hosted AI Stack

I run several AI models locally on my Proxmox setup—mostly for text generation and image processing. These models live in Docker containers, and for months, everything worked fine. Then I started noticing something frustrating: after a few days of uptime, inference would slow down. Sometimes requests would hang completely. A container restart always fixed it, but I needed to catch the problem before it affected actual use.

The culprit was GPU memory leaks. Not dramatic crashes, just gradual accumulation until the model couldn’t allocate what it needed. I needed a way to detect this automatically and restart containers before they became unusable.

Setting Up Docker Compose Health Checks for Self-Hosted AI Models That Monitor GPU Memory Leaks and Trigger Automatic Restarts

My Setup and What I Actually Run

Here’s what I’m working with:

Proxmox host with an NVIDIA GPU passed through to a VM
Docker containers running llama.cpp-based models
Models exposed via OpenAI-compatible APIs
n8n workflows that depend on these models being responsive

I don’t use Docker Model Runner or the newer Compose models syntax. My setup predates that, and I haven’t migrated because what I have works. Each model runs as a standard service in Docker Compose.

The Health Check Strategy That Actually Works

Docker Compose has built-in health checks, but the default options weren’t enough. I needed to check two things:

Is the API responding at all?
Is GPU memory usage climbing beyond safe limits?

Here’s the health check section I added to my Compose file:

services:
  llama-model:
    image: my-llama-image
    healthcheck:
      test: ["CMD-SHELL", "/health-check.sh"]
      interval: 2m
      timeout: 10s
      retries: 2
      start_period: 1m

The start_period matters because model initialization takes time. I don’t want false failures during startup.

The Health Check Script

I wrote a simple shell script that runs inside the container. It checks both API health and GPU memory:

#!/bin/bash

# Check if API responds
curl -f http://localhost:8080/health > /dev/null 2>&1
API_STATUS=$?

if [ $API_STATUS -ne 0 ]; then
  echo "API health check failed"
  exit 1
fi

# Check GPU memory if nvidia-smi is available
if command -v nvidia-smi &> /dev/null; then
  USED_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
  TOTAL_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
  
  # Calculate percentage
  PERCENT=$((USED_MEM * 100 / TOTAL_MEM))
  
  # Fail if memory usage exceeds 90%
  if [ $PERCENT -gt 90 ]; then
    echo "GPU memory usage too high: ${PERCENT}%"
    exit 1
  fi
fi

exit 0

This script is copied into the container during the build. The 90% threshold is what worked for my 12GB GPU. Your number will vary based on your hardware and model size.

What Didn’t Work Initially

My first attempt used only API health checks. The problem: the API would still respond even when GPU memory was nearly full. Requests would succeed but take minutes instead of seconds. Users would think the service was broken, but Docker saw it as healthy.

I also tried monitoring from outside the container using nvidia-smi on the host, but attributing memory usage to specific containers was messy. Running the check inside the container, where nvidia-smi sees only that container’s GPU allocation, was cleaner.

The Restart Policy Problem

Docker’s restart policies don’t trigger on health check failures by default. A container can be unhealthy indefinitely without restarting. I needed something external to act on health status.

Automatic Restarts with a Monitoring Script

I run a simple monitoring script on the Docker host via cron every 5 minutes:

#!/bin/bash

UNHEALTHY=$(docker ps --filter health=unhealthy --format "{{.Names}}")

if [ -n "$UNHEALTHY" ]; then
  for container in $UNHEALTHY; do
    echo "$(date): Restarting unhealthy container: $container"
    docker restart $container
    
    # Log to a file for tracking
    echo "$(date): Restarted $container" >> /var/log/docker-health-restarts.log
  done
fi

This runs outside Docker Compose, which means it survives Compose restarts and works across multiple Compose projects.

Why Not Use Docker Events?

I considered using docker events to listen for health status changes, but cron was simpler and more reliable. If the monitoring script crashes, cron restarts it. If I need to debug, the log file shows exactly what happened and when.

GPU Memory Leak Patterns I’ve Observed

After months of logs, I noticed patterns:

Memory usage climbs slowly over 3-5 days of normal use
Large context windows (4k+ tokens) accelerate the leak
Batch processing multiple requests back-to-back makes it worse
Simple inference requests don’t trigger leaks as quickly

These aren’t bugs I can fix—they’re characteristics of the inference engine I’m using. The health check system just makes them manageable.

Real-World Impact

Before implementing this:

I manually restarted containers every few days
Users would report slow responses, and I’d have to investigate
Overnight processing jobs would sometimes fail silently

After:

Containers restart automatically when memory usage gets dangerous
Restarts happen during low-traffic periods (the 2-minute health check interval means failures are caught quickly but not instantly)
I get logs showing when and why restarts happened

The system isn’t perfect. A restart still causes a brief service interruption. But it’s better than gradual degradation that affects all users.

Things I’m Still Figuring Out

This setup works, but there are gaps:

I don’t have a good way to alert on frequent restarts. If a container restarts every hour, something deeper is wrong.
The health check doesn’t distinguish between different types of memory leaks. GPU memory and system memory issues look the same.
I haven’t tested this with multiple GPUs or GPU partitioning (MIG). The nvidia-smi queries might need adjustment.

Key Takeaways from Running This

Health checks need to match your actual failure modes. Generic API pings aren’t enough for GPU workloads. You need to check the resources that actually constrain your service.

Restart automation should be simple and external. Don’t rely on Docker Compose or orchestration features that might not be available everywhere. A cron job works on any Linux system.

Logs matter more than you think. When a container restarts at 3 AM, you need to know why. Timestamp everything.

Thresholds are environment-specific. My 90% GPU memory limit works for my hardware and models. Yours will be different. Start conservative and adjust based on actual behavior.

Gradual failures are harder to catch than crashes. A memory leak that takes days to manifest won’t trigger normal error handling. You need time-based monitoring, not just error-based monitoring.

Tech Expert & Vibe Coder

Setting Up Docker Compose Health Checks for Self-Hosted AI Models That Monitor GPU Memory Leaks and Trigger Automatic Restarts

Why I Started Monitoring GPU Memory in My Self-Hosted AI Stack

My Setup and What I Actually Run

The Health Check Strategy That Actually Works

The Health Check Script

What Didn’t Work Initially

The Restart Policy Problem

Automatic Restarts with a Monitoring Script

Why Not Use Docker Events?

GPU Memory Leak Patterns I’ve Observed

Real-World Impact

Things I’m Still Figuring Out

Key Takeaways from Running This

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Started Monitoring GPU Memory in My Self-Hosted AI Stack

My Setup and What I Actually Run

The Health Check Strategy That Actually Works

The Health Check Script

What Didn’t Work Initially

The Restart Policy Problem

Automatic Restarts with a Monitoring Script

Why Not Use Docker Events?

GPU Memory Leak Patterns I’ve Observed

Real-World Impact

Things I’m Still Figuring Out

Key Takeaways from Running This

Configuring Portainer with Physical Security Controls: Detecting USB Device Passthrough Attacks in Container Environments

Detecting AI model poisoning in local Ollama deployments: validating GGUF checksums against community-verified hashes

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Debugging Docker Compose Healthcheck Failures with Fish Shell Flag Explainers: ...

Building Automated Container Rollback Pipelines: Using Docker Compose Watch...

Debugging Container Timezone and Locale Inconsistencies Across Multi-region...

Get new posts and practical tech notes in your inbox.