Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing Docker Compose jail-style resource isolation using cgroups v2 and seccomp profiles after FreeBSD security disclosures

Why I Worked on This

I run several Docker Compose stacks on my Proxmox host, and after reading about FreeBSD jail security improvements in their recent disclosures, I started questioning how isolated my containers actually were. FreeBSD jails use resource limits and capability restrictions by design—Docker does too, but I realized I'd been lazy about it.

Most of my compose files had no memory limits, no CPU caps, and definitely no seccomp profiles beyond Docker's defaults. When one container spiked CPU during a scraping job, it dragged down everything else. That's when I decided to implement jail-style isolation using cgroups v2 and tighter seccomp controls.

My Real Setup

My environment:

  • Proxmox 8.x host running Ubuntu 22.04 LXC containers
  • Docker 24.x with cgroups v2 enabled (verified with stat -fc %T /sys/fs/cgroup returning cgroup2fs)
  • Multiple compose stacks: n8n, Cronicle, monitoring tools, and some experimental scraping containers
  • No Kubernetes, no orchestration layer—just compose files and systemd units

I wasn't starting from scratch. Docker already uses cgroups to enforce container-wide limits, but I needed per-process control inside containers and stricter syscall filtering.

What I Actually Did

Step 1: Verified Cgroups v2 Was Active

First, I confirmed my host was using cgroups v2, not the older split hierarchy:

stat -fc %T /sys/fs/cgroup

Output: cgroup2fs. Good. If this had returned tmpfs, I would've had to enable v2 in the kernel boot parameters or work with the v1 split hierarchies.

Step 2: Added Resource Limits to Compose Files

I went through my compose files and added explicit limits. Here's what I added to my n8n stack:

services:
  n8n:
    image: n8nio/n8n:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
        reservations:
          memory: 512M

This creates a cgroup for the container with hard caps. Docker writes these limits to /sys/fs/cgroup/docker/<container-id>/cpu.max and memory.max.

Step 3: Created Sub-Cgroups for Internal Processes

For containers running multiple processes (like my scraping stack with a worker and a scheduler), I needed per-process limits. Docker's compose limits apply to the whole container, not individual PIDs inside it.

I mounted the cgroup filesystem into the container:

volumes:
  - /sys/fs/cgroup:/sys/fs/cgroup:rw

Then inside the container, I wrote a startup script that:

  1. Found the container's cgroup path: cat /proc/self/cgroup showed 0::/docker/<hash>
  2. Created a sub-cgroup: mkdir /sys/fs/cgroup/docker/<hash>/worker
  3. Set CPU limits: echo "50000 100000" > /sys/fs/cgroup/docker/<hash>/worker/cpu.max (50% of one core)
  4. Assigned the worker PID: echo $WORKER_PID > /sys/fs/cgroup/docker/<hash>/worker/cgroup.procs

This worked, but it required the container to run as root and have write access to cgroups. Not ideal for untrusted workloads, but acceptable for my controlled environment.

Step 4: Applied Custom Seccomp Profiles

Docker's default seccomp profile blocks around 44 syscalls. I wanted tighter control, especially for containers that don't need network or filesystem mounting.

I created a custom profile based on Docker's default but removed additional syscalls my scraping containers didn't need:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "stat", "fstat", "lstat", "poll", "lseek", "mmap", "mprotect", "munmap", "brk", "rt_sigaction", "rt_sigprocmask", "ioctl", "access", "socket", "connect", "sendto", "recvfrom", "bind", "listen", "accept", "execve", "exit_group", "wait4", "kill", "clone", "fork", "vfork", "getpid", "gettid"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

I removed mount, umount2, ptrace, and several others. Then I applied it in compose:

security_opt:
  - seccomp=/path/to/custom-seccomp.json

Testing this broke one container immediately—it needed chown for a log rotation script. I added it back. This is why you test.

What Worked

  • Per-container CPU/memory limits: Stopped runaway processes from affecting other stacks. My scraper now caps at 1 CPU core instead of spiking to 8.
  • Sub-cgroup isolation: Inside multi-process containers, I can now limit the worker separately from the scheduler. The worker gets 50% CPU, the scheduler gets 10%.
  • Seccomp profiles: Reduced the syscall surface area. One container tried to call ptrace during a dependency install—seccomp blocked it, and I caught a suspicious package.

What Didn't Work

Mounting Cgroups Without Root

I tried running containers as a non-root user with cgroup access. Docker's user namespace remapping (--userns-remap) should theoretically allow this, but I couldn't get write permissions to work correctly. The container could read cgroup files but not create sub-cgroups.

I abandoned this for now. My containers run as root, but with seccomp and cgroup limits, the blast radius is smaller.

Cgroups v1 Compatibility

I tested this setup on an older Proxmox host still using cgroups v1. The split hierarchies (/sys/fs/cgroup/cpu, /sys/fs/cgroup/memory) required separate mounts and different control files (cpu.cfs_quota_us instead of cpu.max).

It worked, but the scripts were messy. I migrated that host to v2 instead of maintaining two code paths.

Seccomp Debugging

When a syscall is blocked, Docker logs nothing by default. I had to enable SCMP_ACT_LOG temporarily to see what was failing:

"defaultAction": "SCMP_ACT_LOG"

Then I checked dmesg on the host for blocked calls. This is tedious. I wish Docker had a --seccomp-debug flag.

Key Takeaways

  • Docker's default isolation is decent, but explicit cgroup limits prevent resource exhaustion across containers.
  • Per-process limits inside containers require manual cgroup management. It's not automatic, and it requires root access to /sys/fs/cgroup.
  • Custom seccomp profiles are worth the effort for containers with known workloads. Start with Docker's default, remove what you don't need, and test thoroughly.
  • Cgroups v2 is simpler than v1. If you're still on v1, consider upgrading.
  • FreeBSD's jail model is conceptually cleaner, but Linux cgroups + seccomp can achieve similar isolation if you're willing to configure them.

What I'm Still Testing

I'm experimenting with --security-opt=no-new-privileges to prevent privilege escalation inside containers. So far, it hasn't broken anything, but I'm watching for edge cases.

I'm also considering AppArmor profiles for containers that interact with the host filesystem, but I haven't needed them yet.