Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging Docker Container OOM Kills Using cgroup v2 Memory Pressure Notifications and eBPF Tracing

Why I Started Digging Into OOM Kills

I run a Proxmox cluster at home with dozens of Docker containers handling everything from automation workflows to AI workloads. For months, I dealt with random container restarts that Docker logs barely explained. The kernel would just say “killed process” with an OOM score, and I’d be left guessing which service actually ran out of memory and why.

Standard monitoring tools like docker stats showed memory usage at the moment I checked, but they couldn’t tell me what happened 10 minutes ago when the container died. I needed something that could catch memory pressure before the kill happened and trace exactly what triggered it.

My Setup and What I Actually Used

My host runs Proxmox 8.x with kernel 6.5, which uses cgroup v2 by default. This matters because cgroup v2 exposes memory pressure notifications that v1 doesn’t have. I verified this by checking:

grep cgroup /proc/filesystems

If you see cgroup2, you’re on v2. If you only see cgroup, you’re still on v1 and these techniques won’t work without migration.

For the actual debugging, I used:

  • bpftrace for eBPF tracing of memory allocation calls
  • systemd-cgtop to watch live cgroup resource usage
  • memory.pressure files in cgroup v2 to detect pressure before OOM
  • Docker’s --memory and --oom-kill-disable flags for controlled testing

I did not use Netdata or any commercial monitoring platform for this specific debugging session. I wanted direct access to kernel interfaces without interpretation layers.

How cgroup v2 Memory Pressure Actually Works

Every Docker container on cgroup v2 gets a memory.pressure file under /sys/fs/cgroup/system.slice/docker-<container_id>.scope/. This file shows three metrics:

  • some: Time spent waiting for memory by at least one task
  • full: Time when all tasks were blocked waiting for memory
  • avg10, avg60, avg300: Pressure averages over time windows

I wrote a simple bash script to poll this file every second and log when pressure crossed 10%:

#!/bin/bash
CONTAINER_ID="your_container_id_here"
CGROUP_PATH="/sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope"

while true; do
  if [ -f "${CGROUP_PATH}/memory.pressure" ]; then
    PRESSURE=$(grep "some avg10=" "${CGROUP_PATH}/memory.pressure" | awk '{print $2}' | cut -d'=' -f2)
    if (( $(echo "$PRESSURE > 10" | bc -l) )); then
      echo "$(date): Memory pressure at ${PRESSURE}%"
    fi
  fi
  sleep 1
done

This caught pressure spikes 30-60 seconds before the actual OOM kill, giving me a window to investigate.

Using eBPF to Trace the Actual Allocations

Memory pressure told me when the problem happened, but not what caused it. For that, I used bpftrace to hook into mm_page_alloc, which the kernel calls every time a process requests memory pages.

I installed bpftrace on the Proxmox host:

apt install bpftrace

Then ran this script while the container was under pressure:

bpftrace -e 'tracepoint:kmem:mm_page_alloc {
  @allocs[comm, pid] = count();
}
interval:s:5 {
  print(@allocs);
  clear(@allocs);
}'

This printed process names and PIDs sorted by allocation frequency every 5 seconds. In my case, I found that a Python process inside my n8n container was allocating memory in a loop while processing a large JSON file I was scraping.

The output looked like this:

@allocs[python3, 1234]: 45892
@allocs[node, 5678]: 3421
@allocs[dockerd, 91011]: 892

The Python process was allocating 10x more than anything else. That was my smoking gun.

What Didn’t Work

I tried using docker stats with --no-stream piped to a log file, but it only sampled every 2 seconds and missed the spikes entirely. By the time I saw high memory usage, the container was already killed.

I also attempted to use perf record to profile the container, but it required kernel symbols that weren’t available in my Proxmox setup without recompiling the kernel. I abandoned that approach because I wasn’t willing to break my production host for debugging.

Setting --oom-kill-disable on the container was useful for testing, but dangerous. The container would freeze instead of dying, which meant I had to manually restart it. This is fine for debugging but not for production.

The Memory Limit Trap

I initially set Docker memory limits too aggressively (512MB for a container that regularly processed 2GB files). This caused constant pressure even under normal load. I raised the limit to 2GB and added swap, which reduced OOM kills but didn’t eliminate them.

The real issue was that my Python script wasn’t streaming the JSON—it loaded the entire file into memory. I rewrote it to use ijson for iterative parsing, which dropped peak memory usage from 1.8GB to under 300MB.

Connecting Pressure Notifications to eBPF Data

The most useful technique I found was running both tools simultaneously. I kept the pressure monitoring script running in one terminal and bpftrace in another. When pressure spiked, I could immediately see which process was allocating.

For automation, I modified the pressure script to trigger bpftrace only when pressure exceeded 20%:

if (( $(echo "$PRESSURE > 20" | bc -l) )); then
  echo "$(date): Pressure at ${PRESSURE}%, starting eBPF trace"
  timeout 30 bpftrace trace.bt > /var/log/oom-trace-$(date +%s).log &
fi

This avoided running eBPF constantly (which has overhead) and only captured traces during actual incidents.

Key Takeaways

cgroup v2 is not optional for this. If you’re still on v1, you don’t get memory.pressure files and you’re stuck with post-mortem analysis from kernel logs.

eBPF overhead is real. Running bpftrace continuously on a production system will slow things down. I saw 2-5% CPU overhead on my Proxmox host when tracing page allocations. Use it only when you need it.

Memory limits need headroom. If your container regularly uses 80% of its limit, you’re already in pressure territory. I now set limits at 2x the normal working set and monitor pressure instead of absolute usage.

Streaming beats buffering. Most of my OOM issues came from loading entire datasets into memory instead of processing them incrementally. This is a code problem, not a memory problem.

Logs are useless after the fact. By the time the kernel kills your container, the evidence is gone. You need live monitoring or you’re guessing.

What I Still Don’t Know

I haven’t figured out how to reliably trace memory allocations inside containers without running eBPF on the host. There’s probably a way to do this with Docker’s user namespaces, but I haven’t tested it.

I also don’t know if memory.pressure is accurate under heavy swap usage. My system rarely swaps, so I can’t verify how the metrics behave when memory pressure is being relieved by disk I/O.

Leave a Comment

Your email address will not be published. Required fields are marked *