Why I worked on this
I run a handful of public-facing containers on the same Docker host that also hosts my internal stuff—CI, monitoring, home-automation, the works. Docker’s default “open egress” policy started to feel reckless: every container can reach the whole LAN and the internet. I don’t run Kubernetes here, so network policies aren’t an option. I wanted something that:
- works on a single-node Docker box,
- doesn’t need another overlay network or service mesh,
- can be dropped in without rebuilding images,
- and shows up in the logs so I can tune it.
eBPF can hook syscalls at the kernel level, so I decided to see if I could use it to block outbound connections from specific containers without touching Docker’s network stack.
My real setup or context
- Host: Proxmox VM, Debian 12, 6.1 kernel, cgroup v2 enabled.
- Docker: 24.0.5, rootless mode disabled (I still run the daemon as root—old habits).
- Tooling: I picked
bpftoolandlibbpfinstead of BCC—smaller, no Python runtime on the host. - Target containers: two public nginx containers (labels
egress=block) and one internal Grafana container (labelegress=allow).
What worked (and why)
1. Attach to connect syscall inside the container’s cgroup
eBPF programs can be pinned to a cgroup. When any process in that cgroup calls connect(), the program runs. I compile a tiny BPF program that:
- reads the destination IP from the socket,
- checks a pinned map of “allowed” prefixes,
- returns
-EACCESif the IP isn’t in the map.
// connect_block.c (snippet)
SEC("cgroup/connect4")
int block_egress(struct bpf_sock_addr *ctx)
{
__u32 ip = ctx->user_ip4;
__u32 *allowed = bpf_map_lookup_elem(&allow_map, &ip);
return allowed ? 1 : -EACCES;
}
2. Load and pin the program once
bpftool prog load connect_block.o /sys/fs/bpf/connect_block \
type cgroup/connect4 \
pinmaps /sys/fs/bpf/maps/
3. Attach the program to the container’s cgroup
Docker puts every container under /sys/fs/cgroup/system.slice/docker-<container-id>.scope. I wrote a 20-line Go helper that:
- watches the Docker event API for container starts,
- reads its labels,
- if
egress=block, opens the cgroup path and attaches the pinned program.
echo 2689 > /sys/fs/bpf/maps/cgroup_map # 2689 is the cgroup inode
4. Populate the allow-list map
I keep a simple text file:
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
1.1.1.1/32
A cron job runs every hour and rewrites the pinned map via bpftool map update. No hot-reload drama.
5. Logs you can grep
The kernel prints audit: type=1400 ... denied egress ... every time the program returns -EACCES. I ship those to Loki and alert on spikes.
What didn’t work
- Trying to hook at the Docker network namespace level. You can attach to veth egress tc hooks, but the program runs after NAT, so the source IP is already SNAT-ed to the host. That broke my “per-container” rule idea.
- BCC Python tools. Pulls in 70 MB of dependencies and needs the kernel headers mounted inside the container that runs the tool. Overkill for a single-node box.
- Using
iptableswithdocker0bridge. Works until Docker decides to recreate the bridge after an upgrade—learned that the hard way at 3 a.m. - Pinning the same program to
cgroup/connect6. My first compile forgot to zero the IPv6 flowlabel field; the verifier rejected the program with an opaque “invalid mem access” message. Took two evenings of single-stepping withbpftool prog load ... verifto spot it.
Key takeaways
- cgroup-bpf is the smallest choke point I found that still keeps the policy inside the kernel—no extra bridge, no iptables save/restore dance.
- Pinning maps in
/sys/fs/bpfsurvives daemon restarts; that means you can update the allow-list without re-attaching the program. - Docker’s cgroup path is predictable, but only if you stick to the default
systemdcgroup driver. Switch tocgroupfsand the path changes—my helper script now double-checks/proc/<pid>/cgroupinstead of guessing. - IPv6 is easy to forget. If you block v4 only, a container can still tunnel out via v6. I added a second program for
connect6and duplicated the allow-list logic. - The overhead is invisible in my tests; even a 2 kB response payload fetch from inside the container still averages sub-millisecond connect times. But I only run ~30 containers—YMMV on a dense multi-tenant box.
I still don’t have a slick UI to edit the allow-list—SSH and vim suffice for now. One day I’ll wrap it in a small web form, but only after I catch myself editing the file more than twice a week.