Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging Docker Compose Health Check Failures in Multi-Architecture Deployments: ARM64 vs AMD64 Timing Differences and Timeout Tuning

# Debugging Docker Compose Health Check Failures in Multi-Architecture Deployments: ARM64 vs AMD64 Timing Differences and Timeout Tuning

## Why I Started Looking at This

I run Docker containers across multiple machines in my home lab—some on ARM64 (Raspberry Pi 4, Orange Pi 5) and others on AMD64 (Proxmox nodes with Intel CPUs). For months, I used the same `docker-compose.yml` files across all of them without issues. Then I started seeing intermittent health check failures on ARM devices that never appeared on my AMD64 hosts.

The containers would start, the services would work, but Docker would mark them as unhealthy. Restarting them sometimes fixed it. Other times, they’d stay unhealthy for minutes before suddenly passing. This wasn’t a service problem—it was a timing problem.

## My Real Setup

I have three main Docker hosts:

– **Proxmox VM (AMD64)**: Intel i5, 8GB RAM, running Ubuntu 22.04
– **Raspberry Pi 4 (ARM64)**: 8GB model, running Raspberry Pi OS 64-bit
– **Orange Pi 5 (ARM64)**: 16GB, running Armbian

All run Docker Engine 24.x and Docker Compose v2. I use Portainer to manage stacks, but I also deploy directly via SSH when testing.

The problem showed up most consistently with:

– PostgreSQL containers
– Redis containers
– Custom Node.js apps with HTTP health checks

## What I Observed First

On my AMD64 VM, a typical PostgreSQL health check config looked like this:

yaml
healthcheck:
test: [“CMD-SHELL”, “pg_isready -U postgres”]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
This worked perfectly. The container would start, wait 30 seconds, then pass health checks within the first or second attempt.

On the Raspberry Pi 4 with the **exact same config**, I’d see:

unhealthy: Health check failed
Looking at `docker inspect `, the logs showed:

Health check exceeded timeout (5s)
But the service itself was responding. I could manually run `pg_isready` inside the container, and it returned instantly.

## The ARM64 Slowdown

I started timing things manually. On the Pi, I ran:

bash
time docker exec pg_isready -U postgres
Result: **6-8 seconds** on the first check after container start.

On the AMD64 VM, the same command took **1-2 seconds**.

This wasn’t about PostgreSQL being slow. It was about how long it took Docker to:

1. Fork the health check process
2. Execute the command inside the container
3. Return the result

On ARM64 devices, especially the Pi 4, this overhead was significantly higher. The Orange Pi 5 was better (3-4 seconds) but still slower than AMD64.

## What Didn’t Work

### Increasing `interval`

I tried bumping `interval: 30s` thinking maybe the checks were too frequent. This didn’t help. The **first** health check after `start_period` would still fail due to timeout.

### Lowering `retries`

I reduced `retries: 3` hoping fewer attempts would reduce load. This made things worse—containers would be marked unhealthy faster.

### Using `curl` instead of native checks

For my Node.js apps, I switched from:

yaml
test: [“CMD”, “node”, “healthcheck.js”]
to:

yaml
test: [“CMD”, “curl”, “-f”, “http://localhost:3000/health”]
This helped slightly on AMD64 but made ARM64 **worse** because `curl` had additional startup overhead.

## What Actually Worked

### 1. Increasing `timeout` for ARM64

I created architecture-specific overrides. My base `docker-compose.yml` stayed generic:

yaml
services:
postgres:
image: postgres:15
healthcheck:
test: [“CMD-SHELL”, “pg_isready -U postgres”]
interval: 15s
timeout: 10s
retries: 5
start_period: 40s
For ARM64 hosts, I added a `docker-compose.arm64.yml` override:

yaml
services:
postgres:
healthcheck:
timeout: 15s
start_period: 60s
Deployed with:

bash
docker compose -f docker-compose.yml -f docker-compose.arm64.yml up -d
This fixed most failures. The key was giving ARM devices **more time per check**, not more checks.

### 2. Using `exec` form instead of `shell` form

I changed:

yaml
test: [“CMD-SHELL”, “pg_isready -U postgres”]
to:

yaml
test: [“CMD”, “pg_isready”, “-U”, “postgres”]
This removed the shell invocation overhead. On ARM64, this saved 1-2 seconds per check. On AMD64, it made no noticeable difference.

### 3. Simplifying health check scripts

For my Node.js apps, I was running a full HTTP request in the health check script:

javascript
const http = require(‘http’);
const options = { host: ‘localhost’, port: 3000, path: ‘/health’, timeout: 2000 };
http.get(options, (res) => process.exit(res.statusCode === 200 ? 0 : 1));
I replaced this with a simple file check:

javascript
const fs = require(‘fs’);
process.exit(fs.existsSync(‘/tmp/app-ready’) ? 0 : 1);
The app writes `/tmp/app-ready` when it’s fully initialized. This reduced health check time from 3-4 seconds to under 1 second on ARM64.

### 4. Adjusting `start_period` based on service type

I stopped using a one-size-fits-all `start_period`. For databases and heavy services on ARM64:

– PostgreSQL: `start_period: 60s`
– Redis: `start_period: 30s`
– Node.js apps: `start_period: 45s`

On AMD64, I kept these at 30s, 15s, and 30s respectively.

## Measuring the Difference

I wrote a small script to log health check timing across hosts:

bash
#!/bin/bash
CONTAINER=$1
for i in {1..10}; do
START=$(date +%s%N)
docker exec $CONTAINER pg_isready -U postgres > /dev/null 2>&1
END=$(date +%s%N)
DIFF=$(( ($END – $START) / 1000000 ))
echo “Check $i: ${DIFF}ms”
sleep 2
done
**Results on Raspberry Pi 4:**

Check 1: 7823ms
Check 2: 4521ms
Check 3: 3102ms
Check 4: 2876ms
Check 5: 2654ms

**Results on AMD64 VM:**

Check 1: 1456ms
Check 2: 1203ms
Check 3: 1187ms
Check 4: 1165ms
Check 5: 1198ms

The first check on ARM64 was consistently 5-6x slower than AMD64. Subsequent checks improved but never matched AMD64 speed.

## What I Learned About Docker on ARM64

### CPU instruction differences matter

ARM64 CPUs handle certain system calls differently than x86_64. Docker’s health check mechanism involves process forking, namespace switching, and I/O operations that are optimized for x86_64.

### Storage speed affects health checks

My Pi 4 runs off an SD card (even a fast one). My AMD64 VM uses NVMe. When Docker forks a health check process, it briefly touches disk for container metadata. This adds latency on slower storage.

I tested this by moving a container’s data to a USB 3.0 SSD on the Pi. Health check times dropped from 7s to 4s on first check.

### Docker Engine builds differ

Running `docker version` on both hosts showed the same version number, but `docker info` revealed different storage drivers:

– **AMD64**: `overlay2` with native overlay support
– **Pi 4**: `overlay2` but with older kernel (5.15 vs 6.1 on AMD64)

The older kernel on the Pi added overhead to namespace operations.

## My Current Health Check Strategy

I now maintain two sets of health check configs:

**For AMD64 hosts** (default):

yaml
healthcheck:
timeout: 5s
interval: 10s
retries: 3
start_period: 30s
**For ARM64 hosts** (override):

yaml
healthcheck:
timeout: 15s
interval: 15s
retries: 5
start_period: 60s
I use environment variables to auto-select:

bash
ARCH=$(uname -m)
if [ “$ARCH” = “aarch64” ]; then
docker compose -f docker-compose.yml -f docker-compose.arm64.yml up -d
else
docker compose up -d
fi
This runs in a simple deploy script I keep in each project’s repo.

## When Health Checks Still Fail

Even with tuned timeouts, I occasionally see failures on ARM64 during high system load. If the Pi is running:

– Multiple containers starting simultaneously
– Heavy disk I/O (like database migrations)
– CPU-intensive tasks

Health checks can still timeout. I added a Cronicle job that checks for unhealthy containers and logs the system state:

bash
#!/bin/bash
UNHEALTHY=$(docker ps –filter health=unhealthy –format ‘{{.Names}}’)
if [ -n “$UNHEALTHY” ]; then
echo “Unhealthy containers: $UNHEALTHY”
uptime
free -m
iostat -x 1 3
fi
This helps me correlate failures with resource constraints rather than health check misconfiguration.

## Key Takeaways

1. **ARM64 Docker overhead is real**. Health checks take 3-5x longer than AMD64 on equivalent services.

2. **Timeout is more critical than interval**. Giving each check more time matters more than how often you check.

3. **Use `CMD` form over `CMD-SHELL`** for health checks on ARM64. The shell invocation overhead is significant.

4. **Tune `start_period` per service**, not per architecture. Heavy services need more startup time regardless of CPU.

5. **Storage speed affects health checks**. SD cards add latency. USB SSDs or NVMe make a measurable difference.

6. **Don’t assume cross-architecture compatibility**. Test your health checks on actual ARM64 hardware before deploying.

I still run the same services on both architectures, but I no longer expect identical behavior. The tuning I’ve done makes deployments reliable across my entire lab, even if the timing characteristics differ.

Leave a Comment

Your email address will not be published. Required fields are marked *