Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Setting Up Ollama Model Caching on NFS Shares with CUDA Unified Memory for Multi-Host LLM Inference Clusters

Why I Started Looking at Shared Model Storage

I run multiple Proxmox nodes in my homelab, each with different GPUs. One has an RTX 3060, another has a 1660 Ti, and a third runs CPU-only for lighter workloads. I wanted to run Ollama across all of them without duplicating 40GB+ model files on every machine's local storage.

The problem was simple: downloading the same Llama 3 70B model three times wastes bandwidth, disk space, and time. I needed a way to store models once on my NFS share and have all nodes read from it.

What I didn't expect was how messy this would get when mixing GPU memory, NFS latency, and CUDA's unified memory features.

My Setup

Here's what I actually used:

  • Three Proxmox LXC containers running Ubuntu 22.04
  • Ollama 0.3.11 installed via the official script
  • Synology DS920+ serving NFS shares over 1Gbps network
  • NVIDIA drivers 535.183.01 on GPU nodes
  • Models stored at /mnt/nfs/ollama-models

I configured Ollama to use the NFS path by setting OLLAMA_MODELS=/mnt/nfs/ollama-models in the systemd service file. The NFS mount used default options: rw,sync,hard,intr.

On the surface, this worked. Models loaded. Inference ran. But performance was inconsistent, and I noticed the CPU doing far more work than expected on the GPU nodes.

What Happened with NFS Model Loading

When Ollama loads a model from NFS, it reads the GGUF file in chunks and decides which layers go to GPU memory (VRAM) and which stay in system RAM. For a 70B model quantized to Q4, this means moving roughly 40GB of data.

The first issue I hit: NFS read latency. Even on a gigabit link, reading 40GB takes time. Worse, Ollama doesn't cache the entire model in RAM before starting inference. It streams layers as needed, which means every prompt that touches a layer not yet in memory triggers another NFS read.

I confirmed this by watching nfsstat -c during inference. READ operations spiked constantly, even after the model was supposedly "loaded."

Layer Splitting Across RAM and VRAM

On my RTX 3060 (12GB VRAM), Ollama fit about 60% of the 70B model into GPU memory. The rest stayed in system RAM. This is normal behavior when a model exceeds VRAM capacity.

But here's what I didn't anticipate: the layers in system RAM were still being read from NFS on every inference pass. Ollama doesn't copy the entire model into local RAM upfront. It memory-maps the file, meaning each access to a non-GPU layer hits the NFS share again.

This caused two problems:

  • High CPU usage (8 cores at 100%) because the CPU was handling inference for the RAM-resident layers
  • Stuttering during generation because some tokens waited on NFS reads

I tried increasing num_gpu manually via the API to force more layers onto the GPU, but this just caused OOM errors. The model genuinely didn't fit.

CUDA Unified Memory Experiment

I came across a GitHub issue where someone mentioned setting GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to let the GPU access system RAM directly instead of splitting inference between CPU and GPU.

The idea sounded promising: let the GPU handle all layers, even those in system RAM, by using CUDA's unified memory feature. This would eliminate the CPU bottleneck.

I added the environment variable to Ollama's systemd service:

[Service]
Environment="OLLAMA_MODELS=/mnt/nfs/ollama-models"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"

Then I restarted the service and reloaded the model.

What Actually Happened

Inference became significantly slower. Tokens per second dropped from ~8 to ~2. The GPU utilization stayed low (under 10%), and nvidia-smi showed constant memory transfers between system RAM and VRAM.

What I learned: unified memory doesn't magically make system RAM as fast as VRAM. It just changes who does the work. Instead of the CPU processing RAM-resident layers, the GPU now fetches them over PCIe on every pass. On my system, PCIe bandwidth couldn't keep up, and the GPU spent most of its time waiting.

Adding NFS into the mix made it worse. The GPU was now waiting on network reads and PCIe transfers.

I turned unified memory back off. The hybrid CPU/GPU approach was faster, even if it wasn't elegant.

Attempt at Local Caching

I tried a different approach: copy the model to local storage on first use, then point Ollama at the local copy for subsequent runs.

I wrote a simple script that checked if the model existed locally. If not, it copied from NFS, then updated the OLLAMA_MODELS path to the local directory.

#!/bin/bash
MODEL="llama3:70b"
LOCAL_PATH="/var/lib/ollama/models"
NFS_PATH="/mnt/nfs/ollama-models"

if [ ! -d "$LOCAL_PATH/manifests" ]; then
    echo "Copying model from NFS..."
    rsync -av "$NFS_PATH/" "$LOCAL_PATH/"
fi

export OLLAMA_MODELS="$LOCAL_PATH"
systemctl restart ollama

This worked, but defeated the original goal. Each node now had a full local copy again. I was back to duplicated storage.

The only benefit was faster subsequent loads, since the model was on local NVMe instead of NFS.

Why This Setup Didn't Work for Me

The core problem: NFS is not designed for the access pattern Ollama uses.

Ollama memory-maps model files and expects low-latency random access. NFS adds 1-5ms of latency per read, which is negligible for large sequential transfers but terrible for the thousands of small reads Ollama makes during inference.

Even on a 10Gbps network, I don't think this would fully solve the problem. The latency, not bandwidth, is the killer.

CUDA unified memory made things worse because it added another layer of indirection. The GPU had to fetch data from system RAM, which was itself fetching from NFS. The result was a three-tier memory hierarchy with compounding latency.

What Would Actually Work

If I were to attempt this again, I'd consider:

  • Pre-loading models into RAM: Use a script to copy the model into /dev/shm (tmpfs) before starting Ollama. This eliminates NFS from the inference path entirely, at the cost of RAM usage.
  • Smaller models: A 13B model fits entirely in VRAM on my 3060. No split, no NFS reads during inference.
  • Dedicated model server: Run one Ollama instance with the model fully loaded, then have other nodes make API calls to it instead of running their own instances.

None of these are true "multi-host clusters." They're workarounds that accept the reality of network and memory constraints.

Key Takeaways

NFS model storage only works if you're okay with slow initial loads. Once the model is in RAM or VRAM, inference is fine. But memory-mapped access over NFS during inference is a non-starter.

CUDA unified memory is not a magic fix. It trades CPU processing for GPU memory fetches, which can be slower depending on your PCIe and network setup.

Splitting layers between RAM and VRAM is often the best compromise. The CPU handles some layers, the GPU handles others. It's not elegant, but it's faster than forcing everything onto the GPU when the model doesn't fit.

If you want true multi-host LLM serving, use a proper inference server. Tools like vLLM or Text Generation Inference are designed for distributed setups. Ollama is great for single-node local inference, but it's not built for the kind of shared-storage clustering I tried to force onto it.

I still use Ollama across multiple nodes. I just accept that each one has its own model storage now. The simplicity and reliability are worth the duplicated disk space.