Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging Bash Script Race Conditions in Parallel Container Deployments:  Fixing Lock File Issues When Auto-scaling Docker Services

Why I Had to Debug This

I run a small cluster of Docker services on Proxmox that auto-scale based on load. Each service has an entrypoint script that handles initialization—things like waiting for dependencies, setting up config files, and registering the container with a central service registry I built using a simple file-based system.

When Docker Swarm scaled a service from 1 to 5 replicas during a traffic spike, I started seeing duplicate entries in my registry, corrupted state files, and containers that thought they were the “first” instance when they weren’t. The problem was obvious: my bash scripts had race conditions. Multiple containers were running the same initialization code at the exact same time, all trying to read and write the same shared files mounted via NFS.

I needed to fix this without rewriting everything in a different language or adding external dependencies like Redis. The solution had to work with what I already had: bash, shared storage, and containers that could appear or disappear at any moment.

What I Was Actually Dealing With

My setup uses Docker Swarm with services that mount a shared NFS volume at /shared. The entrypoint script in each container does this:

  • Reads a counter from /shared/instance_count
  • Increments it
  • Writes it back
  • Uses that number as its instance ID

When one container runs this, it works fine. When five start simultaneously, they all read “0”, all write “1”, and I end up with five containers claiming to be instance 1.

I initially tried using a simple lock file approach I found online—something like checking if a file exists, creating it if not, doing work, then deleting it. That failed immediately because the check and create weren’t atomic. Two containers would both see no lock file, both create one, and both proceed.

What Actually Worked

The solution that worked for me uses mkdir as the locking mechanism. Unlike creating a file, mkdir is atomic on most filesystems, including NFS (though NFS has caveats I’ll mention later).

Here’s the core pattern I ended up using:

#!/bin/bash

LOCK_DIR="/shared/.lock"
MAX_WAIT=30
WAIT_COUNT=0

# Try to acquire lock
while ! mkdir "$LOCK_DIR" 2>/dev/null; do
    sleep 1
    WAIT_COUNT=$((WAIT_COUNT + 1))
    if [ $WAIT_COUNT -ge $MAX_WAIT ]; then
        echo "Failed to acquire lock after ${MAX_WAIT} seconds"
        exit 1
    fi
done

# Ensure lock is released on exit
trap 'rmdir "$LOCK_DIR" 2>/dev/null' EXIT

# Critical section - only one container executes this at a time
if [ -f /shared/instance_count ]; then
    CURRENT=$(cat /shared/instance_count)
else
    CURRENT=0
fi

NEXT=$((CURRENT + 1))
echo $NEXT > /shared/instance_count
MY_INSTANCE=$NEXT

# Lock automatically released by trap when script exits
echo "I am instance $MY_INSTANCE"

The key parts:

  • mkdir fails if the directory already exists, and this check-and-create is atomic
  • The trap ensures the lock directory gets removed even if the script crashes or is killed
  • The timeout prevents infinite waiting if something goes wrong

I tested this by manually starting 10 containers simultaneously using a bash loop, and each one got a unique instance number. No duplicates, no corruption.

What Didn’t Work

My first attempt used ln to create a hard link as a lock, based on Stack Overflow advice. The idea was that ln would fail if the target already exists. This worked in local testing but failed completely on NFS. Hard links behave unpredictably across NFS mounts, and I got inconsistent results depending on which Proxmox node the container landed on.

I also tried using flock, which is the “proper” way to do file locking in Linux. The problem: flock doesn’t work reliably on NFS unless you have very specific NFS configurations (NFSv4 with proper lock daemons). My NFS setup is NFSv3 because that’s what my Synology supports without jumping through hoops, and flock just silently failed to actually lock anything.

I briefly considered using a directory with a PID file inside it to detect stale locks (where a container died without cleaning up). I implemented this, and it added 30 lines of code to handle edge cases like checking if the PID is still running, dealing with PID reuse, and cleaning up stale locks. It worked, but it was fragile and hard to reason about. I removed it and went back to the simple timeout approach—if a lock is held for more than 30 seconds, something is seriously wrong anyway.

The NFS Problem I Hit

Even with mkdir, I ran into issues on NFS. The problem is attribute caching. NFS clients cache directory metadata for performance, which means one container might not immediately see a directory another container just created.

I fixed this by adding noac (no attribute caching) to my NFS mount options in the Docker volume definition:

volumes:
  shared:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.1.10,noac,vers=3
      device: ":/volume1/shared"

This has a performance cost—every file operation now goes to the NFS server—but for my use case (infrequent initialization scripts), it’s fine. If I were doing high-frequency locking, I’d need a different approach entirely, probably not bash and not NFS.

Stale Lock Handling

The timeout approach works for most cases, but I still had to handle stale locks from containers that got killed with SIGKILL or from node crashes. The trap doesn’t run in those cases.

I added a simple age check before the timeout:

if [ -d "$LOCK_DIR" ]; then
    LOCK_AGE=$(($(date +%s) - $(stat -c %Y "$LOCK_DIR")))
    if [ $LOCK_AGE -gt 300 ]; then
        echo "Lock is stale (${LOCK_AGE}s old), removing"
        rmdir "$LOCK_DIR" 2>/dev/null
    fi
fi

If the lock directory is more than 5 minutes old, I assume it’s stale and remove it. This isn’t perfect—there’s still a race condition where two containers might both decide to remove a stale lock—but in practice, it’s good enough. The worst case is both containers proceed, which is the same problem I started with, but now it only happens in rare failure scenarios instead of every time.

Key Takeaways

Use mkdir for locking in bash scripts. It’s atomic on most filesystems and doesn’t require external tools.

Always use trap to clean up locks. Without it, any script failure leaves a lock behind.

NFS makes everything harder. Attribute caching breaks assumptions about atomicity. If you’re doing this on NFS, test thoroughly and consider noac if you can afford the performance hit.

Timeouts are essential. Without them, one failed container can deadlock your entire system.

Stale lock detection is necessary in production. Containers get killed, nodes crash, and your cleanup code won’t always run.

This approach works for my scale (dozens of containers, infrequent scaling events). If I were running hundreds of containers or scaling constantly, I’d use a proper distributed lock service. But for small self-hosted setups, bash and mkdir get the job done.

Leave a Comment

Your email address will not be published. Required fields are marked *