Why I Had to Debug This
I run local LLM inference workloads on my Proxmox homelab using Docker containers. When I started experimenting with higher token throughput—specifically streaming responses from models like Llama 3.1 70B—I noticed something odd: the response latency was inconsistent, sometimes spiking by 200-300ms mid-stream. This wasn't model performance; the GPU utilization stayed steady. Something in the network path was introducing delays.
My setup uses Docker's default bridge network. The LLM inference engine runs in a container, and I access it via the OpenAI-compatible API endpoint from both host processes and other containers. For most tasks, this works fine. But when token throughput increases—say, 50+ tokens per second with streaming enabled—the bridge network started showing its limits.
My Actual Setup
Here's what I was running:
- Proxmox VE 8.2 host with an NVIDIA RTX 4090 passed through to an Ubuntu 24.04 LXC container
- Docker Engine 27.x inside the LXC, using the default bridge network (docker0)
- llama.cpp server running in a container, exposing port 8080 mapped to the host
- Client applications (both containerized and host-based) making requests to localhost:8080
- Models: Llama 3.1 70B quantized (Q4_K_M), running with GPU offloading enabled
The inference engine itself was fast. Token generation stayed around 45-50 tokens/second, which matched my GPU's capabilities. But when I monitored the network path, I saw packet delays that didn't align with the GPU work.
What I Found (The Hard Way)
I started by assuming the problem was Docker's bridge network MTU or some misconfiguration in iptables NAT rules. I was wrong on both counts.
First, I checked MTU:
docker network inspect bridge | grep -i mtu
The default MTU was 1500, which matched my host interface. No fragmentation was happening. I tried bumping it to 9000 (jumbo frames) anyway—no change in latency patterns.
Next, I looked at iptables rules. Docker sets up MASQUERADE rules for outbound traffic from containers. I suspected these rules might be adding overhead during high-throughput streaming. I tested by creating a custom bridge network with no NAT:
docker network create --driver bridge --subnet 172.20.0.0/16 llm-net
Then I ran the container attached to this network and accessed it directly via its container IP. Latency improved slightly, but the spikes were still there. This ruled out NAT overhead as the primary cause.
The real issue turned out to be Docker's default bridge network driver itself. The bridge driver uses a virtual Ethernet pair (veth) to connect containers to the docker0 bridge. Each packet has to traverse:
- Container's network namespace
- veth pair
- docker0 bridge
- Host's network namespace
- Loopback interface (for localhost access)
This path introduces context switches and buffer copies. For low-throughput requests, it's negligible. But when streaming 50 tokens/second with small HTTP chunks (each token is a separate chunk in Server-Sent Events), the overhead compounds.
What Actually Worked
I switched to Docker's host network mode:
docker run -d --gpus all --network host \ -v /path/to/models:/models \ ghcr.io/ggerganov/llama.cpp:server \ --model /models/llama-3.1-70b-q4_k_m.gguf \ --host 0.0.0.0 --port 8080
With host networking, the container shares the host's network namespace. No veth pairs, no bridge, no NAT. The inference server binds directly to the host's port 8080.
Latency spikes disappeared. Token streaming became consistent at 45-50 tokens/second with no mid-stream delays. The difference was measurable: median response time dropped from ~280ms to ~120ms for the same prompts.
The trade-off: I lost port isolation. If I run multiple inference servers, I have to manually assign different ports. But for my use case—one primary LLM service—this is fine.
What Didn't Work (And Why)
Before landing on host networking, I tried several things that failed:
Custom bridge with larger MTU: As mentioned, this didn't help. The overhead wasn't from packet fragmentation; it was from the bridge traversal itself.
Increasing Docker daemon's default-address-pools: I thought maybe IP exhaustion or ARP table size was causing issues. It wasn't. The problem was architectural, not configurational.
Using macvlan network: I tried giving the container its own MAC address on the host's network. This worked for external access but introduced routing complexity for host-to-container communication. Plus, it required promiscuous mode on the host interface, which my Proxmox setup doesn't support cleanly.
Tuning kernel network buffers: I increased net.core.rmem_max and net.core.wmem_max, thinking buffer exhaustion might be the issue. No measurable impact. The latency spikes weren't from dropped packets or retransmits.
When Bridge Networking Is Still Fine
Don't assume host networking is always better. For most Docker workloads, the bridge network is perfectly adequate. I still use it for:
- Web services with moderate request rates
- Batch processing jobs where latency doesn't matter
- Development environments where port isolation is useful
The bridge network overhead only becomes noticeable when you're pushing high packet rates with small payloads—like streaming LLM tokens. If your inference workload is batch-based (send prompt, wait for full response), you won't see this issue.
Key Takeaways
Docker's bridge network adds a non-trivial overhead for high-throughput, low-latency workloads. The overhead isn't from MTU, NAT, or iptables rules—it's from the veth pair and bridge traversal itself.
Host networking eliminates this overhead but removes port isolation. For single-service setups like a dedicated LLM inference server, this trade-off is worth it.
If you're debugging similar issues, start by measuring at the right layer. Don't assume it's a configuration problem. Sometimes the architecture itself is the bottleneck.
I still use bridge networking for most containers. But for workloads where every millisecond counts, host networking is the right choice.