Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Running Llama 3.3 70B on Consumer Hardware with Ollama Quantization and Multi-GPU Splitting Across PCIe 3.0 Slots

Why I Needed to Run Llama 3.3 70B Locally

I've been running smaller models through Ollama for months—7B and 13B variants that fit comfortably in a single GPU's VRAM. They work fine for basic tasks, but I kept hitting their limits when processing longer documents or maintaining context across complex conversations. I wanted something closer to GPT-4's capability without sending data to external APIs.

Llama 3.3 70B became available, and I knew it wouldn't fit in my single RTX 3090's 24GB. I had two options: rent cloud GPU time (expensive for regular use) or figure out how to split the model across the hardware I already own. I chose the latter because I prefer systems I can control and iterate on without hourly costs.

My Actual Hardware Setup

My main workstation runs Proxmox, but for this experiment I used bare metal Ubuntu 22.04 because GPU passthrough adds unnecessary complexity when you're already pushing hardware limits.

The relevant specs:

  • CPU: AMD Ryzen 9 5950X (16 cores, 32 threads)
  • RAM: 128GB DDR4-3600
  • GPU 1: RTX 3090 (24GB VRAM) in PCIe 4.0 x16 slot
  • GPU 2: RTX 3060 Ti (8GB VRAM) in PCIe 3.0 x8 slot
  • Storage: Samsung 980 Pro 2TB NVMe (for model files)

The mismatched GPUs weren't planned—I bought the 3060 Ti years ago for video encoding work. The PCIe 3.0 limitation on the second slot matters because model layers need to transfer data between GPUs during inference.

Understanding Quantization Before Starting

A full-precision 70B model needs roughly 140GB of VRAM (2 bytes per parameter in FP16). I have 32GB total across both cards. Quantization reduces precision to lower bit depths, shrinking memory requirements.

Ollama supports several quantization levels. I tested three:

  • Q4_K_M (4-bit, medium): ~38GB model size
  • Q5_K_M (5-bit, medium): ~47GB model size
  • Q8_0 (8-bit): ~70GB model size

Q8_0 was immediately out—too large even with system RAM offloading. Q5_K_M could theoretically work but left almost no headroom for context. Q4_K_M became the target.

Installing Ollama with Multi-GPU Support

I already had Ollama installed from previous projects, but multi-GPU support required specific environment variables. The default installation tries to use only the primary GPU.

curl -fsSL https://ollama.com/install.sh | sh

After installation, I verified both GPUs were visible:

nvidia-smi

Both cards showed up. The key configuration happens through environment variables before starting the Ollama service. I edited /etc/systemd/system/ollama.service and added:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_GPU=2"

Then reloaded and restarted:

sudo systemctl daemon-reload
sudo systemctl restart ollama

This tells Ollama both GPUs exist and should be used for layer distribution.

Pulling and Running the Quantized Model

Ollama's model library includes pre-quantized versions. I pulled the Q4_K_M variant:

ollama pull llama3.3:70b-instruct-q4_K_M

This took about 40 minutes on my connection. The model file landed in ~/.ollama/models at roughly 38GB.

First run attempt:

ollama run llama3.3:70b-instruct-q4_K_M

Ollama automatically split layers across both GPUs. I watched nvidia-smi in another terminal—the 3090 loaded about 24GB, the 3060 Ti took around 8GB, and the remaining layers spilled into system RAM.

What Actually Happened During Inference

The model loaded, but performance wasn't what I expected. Initial token generation took 8-12 seconds. Subsequent tokens in the same conversation averaged 2-3 seconds each. This was slower than I'd hoped but still usable for document processing where I could batch requests.

Monitoring showed the bottleneck: the PCIe 3.0 x8 connection on the second GPU. When layers needed to communicate between cards, data had to traverse that slower bus. The 3090 would finish its computation and wait for the 3060 Ti to catch up.

System RAM usage spiked to about 45GB during inference—not just from model layers but from context management and intermediate computations. With 128GB total, this was fine, but it would've failed on a 64GB system.

Practical Performance Numbers

I tested three scenarios to understand real-world usability:

Short prompt (50 tokens in, 100 tokens out):

  • First token: 9.2 seconds
  • Average token: 2.4 seconds
  • Total time: ~250 seconds

Medium context (500 tokens in, 300 tokens out):

  • First token: 11.8 seconds
  • Average token: 2.7 seconds
  • Total time: ~820 seconds

Long context (2000 tokens in, 500 tokens out):

  • First token: 18.3 seconds
  • Average token: 3.1 seconds
  • Total time: ~1570 seconds

The performance degraded with context length because more data needed to move between GPUs and system RAM. For interactive chat, this was frustrating. For batch processing where I could start a job and walk away, it worked.

What Didn't Work

My first attempt used the Q5_K_M quantization thinking better quality would be worth slower speeds. The model loaded but immediately started swapping to disk because system RAM couldn't handle the overflow. Inference became unusable—30+ seconds per token. I killed it after two minutes.

I also tried limiting the model to only the 3090 by setting CUDA_VISIBLE_DEVICES=0. Ollama refused to load the model, correctly detecting insufficient VRAM. There's no partial loading option that I found.

Temperature and power draw became issues during extended runs. The 3090 hit 78°C under sustained load, and my UPS reported the system pulling 520W from the wall. I had to improve case airflow with an additional fan before running longer sessions.

Optimizations That Actually Helped

After the initial disappointing performance, I made several changes:

1. Reduced context window

Ollama defaults to a 4096 token context. I limited it to 2048 for most tasks:

ollama run llama3.3:70b-instruct-q4_K_M --ctx-size 2048

This cut first-token latency by about 30% and reduced RAM pressure.

2. Batch processing with scripts

Instead of interactive chat, I wrote Python scripts using Ollama's API to process documents in batches. This eliminated the waiting-for-response frustration:

import ollama

def process_batch(prompts):
    results = []
    for prompt in prompts:
        response = ollama.generate(
            model='llama3.3:70b-instruct-q4_K_M',
            prompt=prompt,
            options={'num_ctx': 2048}
        )
        results.append(response['response'])
    return results

I could queue up 20 document summaries, start the script, and come back to completed results.

3. Moved model files to NVMe

Initially the model files were on a SATA SSD. Moving them to the NVMe drive reduced load times from 45 seconds to about 18 seconds. Not a huge win, but noticeable when restarting the service.

Real-World Use Cases Where This Works

I wouldn't use this setup for:

  • Real-time chat applications
  • Customer-facing services
  • Anything requiring sub-second responses

It does work well for:

  • Overnight document processing (I summarize research papers)
  • Code review where I can batch requests
  • Extracting structured data from unstructured text
  • Generating training data for smaller models

I run a weekly job that processes my saved articles and generates summaries. The script runs Sunday mornings, takes about 4 hours for ~100 articles, and I review results Monday. The slow per-token speed doesn't matter because I'm not waiting.

Cost Reality Check

If I were starting from scratch, this wouldn't make financial sense. The GPUs alone represent $1200-1500 in current used market prices. For that money, I could buy a lot of API credits from OpenAI or Anthropic.

But I already owned this hardware. The incremental cost was:

  • Time: ~8 hours of setup and testing
  • Power: roughly $15/month additional electricity for weekly batch jobs
  • Cooling: one $25 case fan

The value proposition is privacy and control. My documents never leave my network. I can modify prompts and rerun jobs without worrying about API rate limits or cost spikes.

What I Learned About PCIe Bandwidth

The PCIe 3.0 x8 limitation on the second GPU is a real bottleneck. I measured it using nvidia-smi dmon during inference:

The 3090 in PCIe 4.0 x16 showed PCIe throughput around 18-22 GB/s during heavy layer transfers. The 3060 Ti in PCIe 3.0 x8 maxed out at 6-7 GB/s. This mismatch meant the faster GPU spent significant time idle waiting for data.

If I were building this intentionally, I'd either use two identical GPUs in PCIe 4.0 x16 slots, or accept that mismatched cards mean performance will be gated by the slower one.

Stability Over Long Runs

I ran a 12-hour batch job to test stability. The system stayed responsive, but I encountered two issues:

Memory leak somewhere: System RAM usage crept from 45GB to 68GB over those 12 hours. Restarting the Ollama service cleared it. I suspect it's in Ollama's context management, but I haven't debugged deeply enough to confirm.

GPU driver timeout: Once, the NVIDIA driver reset itself mid-inference, killing the job. This happened during a particularly long generation (1200+ tokens). I increased the driver's timeout threshold:

sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 350  # Set power limit to 350W for 3090

Haven't seen the timeout since, but I also started limiting output length to 800 tokens as a precaution.

Key Takeaways

Running Llama 3.3 70B on consumer hardware is possible but requires accepting significant trade-offs:

  • Quantization to Q4_K_M is mandatory for 32GB total VRAM
  • Multi-GPU splitting works but performance depends heavily on PCIe bandwidth
  • System RAM becomes critical—64GB minimum, 128GB comfortable
  • This setup suits batch processing, not real-time interaction
  • Power and cooling matter more than with smaller models

The experience taught me more about GPU memory architecture and PCIe limitations than any documentation would have. If you already have similar hardware and need private LLM inference for non-time-sensitive tasks, this approach works. If you're buying hardware specifically for this, seriously consider cloud alternatives or waiting for more efficient models.

I still use this setup weekly because it solves my specific problem: processing sensitive documents without external APIs. The slow speed is annoying but acceptable for that use case.