Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Running Llama 3.3 70B on consumer hardware using Ollama with 4-bit quantization and CPU offloading for sub-10s response times

Why I Started Running Large Models Locally

I needed a 70B parameter model running on hardware I actually own. Not cloud credits, not API calls with usage limits—a model I could query as many times as needed without watching a billing dashboard.

My setup: a desktop with an RTX 3090 (24GB VRAM), 64GB system RAM, and a Ryzen 9 5950X. Not server hardware. Not a data center. Just what I could afford and fit under my desk.

The problem was simple: Llama 3.3 70B in full precision needs roughly 140GB of VRAM. I had 24GB. The math didn’t work.

What Quantization Actually Does

Quantization reduces the precision of model weights. Instead of storing each parameter as a 16-bit floating point number, you store it as 4 bits, 5 bits, or 8 bits.

This isn’t compression in the traditional sense—you’re permanently reducing numerical precision. A 16-bit weight might represent 65,536 possible values. A 4-bit weight represents only 16.

The trade-off: smaller memory footprint, faster inference, but slightly degraded output quality. How much degradation depends on the quantization level and the task.

Why I Use GGUF Format

GGUF (GPT-Generated Unified Format) is what Ollama and llama.cpp use. It replaced the older GGML format and handles metadata better.

I chose GGUF because:

  • Ollama supports it natively
  • It works across CPU and GPU inference
  • The quantization tools are mature and well-documented
  • Pre-quantized models are available for immediate use

Other formats exist (AWQ, GPTQ, BitsAndBytes), but GGUF gave me the most flexibility without fighting with dependencies.

My Real Setup: Ollama with 4-bit Quantization

I run Ollama because it handles model management, API serving, and GPU/CPU offloading without requiring me to write inference code.

Installing Ollama

On my Linux system:

curl -fsSL https://ollama.ai/install.sh | sh

Verification:

ollama --version

That’s it. No virtual environments, no Python dependency hell.

Downloading the Quantized Model

I started with the 4-bit quantized version of Llama 3.3 70B:

ollama pull llama3.3:70b-q4_K_M

This downloads a ~42GB model file. On my connection, it took about 30 minutes.

The naming convention:

  • 70b = 70 billion parameters
  • q4_K_M = 4-bit quantization, K-quant method, medium variant

Configuring GPU and CPU Offloading

With 24GB VRAM, I can’t fit the entire 42GB model in GPU memory. Ollama automatically offloads layers to system RAM when VRAM fills up.

I set this in my ~/.bashrc:

export OLLAMA_NUM_GPU_LAYERS=40

This tells Ollama to load 40 layers onto the GPU. The rest run on CPU. I arrived at 40 through trial and error—more layers caused OOM errors, fewer layers made inference slower.

For CPU thread control:

export OLLAMA_NUM_THREAD=16

I have 32 threads available, but using all of them made the system unresponsive during inference.

What Worked: Sub-10 Second Response Times

Running the model:

ollama run llama3.3:70b-q4_K_M

First token latency: ~2 seconds
Subsequent tokens: ~8-12 tokens per second
Total time for a 100-token response: ~8-10 seconds

This is with mixed GPU/CPU inference. Pure GPU inference (on hardware with enough VRAM) would be faster, but this speed is usable for my workflows.

Real Usage Example

I use this model for code review and technical writing feedback. A typical prompt:

"Review this Python function for edge cases and suggest improvements: [code block]"

Response time: 7-9 seconds for a 150-200 token analysis.

For comparison, the 8B model at 4-bit quantization responds in 2-3 seconds, but the quality difference is noticeable for complex reasoning tasks.

Memory Usage

While running:

  • VRAM usage: 23.8GB (nearly maxed out)
  • System RAM usage: ~18GB (for offloaded layers)
  • CPU usage: 60-80% during generation, near 0% while idle

The system remains responsive. I can browse, write, and run other applications while the model generates.

What Didn’t Work

2-bit Quantization Was Unusable

I tried the Q2_K variant to see if I could fit more of the model in VRAM:

ollama pull llama3.3:70b-q2_K

File size: ~28GB
Response quality: Noticeably degraded

The model would:

  • Repeat phrases unnecessarily
  • Lose coherence in longer responses
  • Occasionally generate nonsensical tokens

I deleted it after two days. The size savings weren’t worth the quality loss.

Running Without CPU Offloading Failed

I attempted to force the entire model onto the GPU by setting OLLAMA_NUM_GPU_LAYERS to 80 (the full layer count).

Result: Immediate out-of-memory error. The process crashed before loading completed.

Lesson: If your VRAM can’t hold the model, don’t try to force it.

8-bit Quantization Was Too Large

The Q8_0 variant (~75GB) required too much offloading to system RAM. Inference slowed to 3-4 tokens per second, making it unusable for interactive work.

I kept the Q4_K_M variant as the best balance of size and quality.

Quantization Levels I Actually Tested

I didn’t test every possible quantization method. Here’s what I ran:

Q4_K_M (4-bit, medium)

  • Size: ~42GB
  • Quality: Good for most tasks
  • Speed: 8-12 tokens/second with my setup
  • Use case: My daily driver

Q5_K_M (5-bit, medium)

  • Size: ~52GB
  • Quality: Slightly better than Q4_K_M
  • Speed: 6-10 tokens/second
  • Use case: Tested but didn’t keep—quality improvement was marginal

Q8_0 (8-bit)

  • Size: ~75GB
  • Quality: Very close to full precision
  • Speed: 3-4 tokens/second
  • Use case: Too slow for my hardware

Q2_K (2-bit)

  • Size: ~28GB
  • Quality: Poor
  • Speed: Fast but irrelevant due to quality issues
  • Use case: Deleted

Running Models via API

Ollama runs a local API server on port 11434. I use this for automation scripts.

Starting the Server

ollama serve

This runs in the background. Models load on first request.

Querying from Python

import requests
import json

def query_model(prompt):
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": "llama3.3:70b-q4_K_M",
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=payload)
    return response.json()["response"]

result = query_model("Explain CPU offloading in LLM inference")
print(result)

Response time: Same as interactive mode (~8-10 seconds for typical responses).

I use this for batch processing code reviews and documentation generation.

Key Takeaways

4-bit quantization is the practical sweet spot for running 70B models on consumer hardware. The quality loss is minimal for most tasks, and the size reduction makes inference feasible.

CPU offloading works but requires tuning. Too many GPU layers cause OOM errors. Too few slow down inference. I found my balance at 40 layers through testing.

2-bit quantization is not worth it unless you have extreme hardware constraints. The quality degradation is too severe for reliable use.

Ollama handles complexity well. I didn’t need to write custom inference code, manage model files manually, or configure GPU drivers beyond basic CUDA setup.

Response times under 10 seconds are usable for interactive work. It’s not instant, but it’s fast enough that I don’t context-switch while waiting.

What I Would Change

If I were starting over with a larger budget, I’d get 48GB VRAM (two 24GB cards or a single high-end GPU). This would eliminate CPU offloading and likely cut response times in half.

But for the hardware I have, 4-bit quantization with Ollama gets the job done.

Leave a Comment

Your email address will not be published. Required fields are marked *