Why I Started Benchmarking Local LLM Inference
I run Ollama and LM Studio on my home server because I need AI models that don't send my data to external APIs. But after installing DeepSeek R1 in different quantization formats, I noticed something odd: some versions felt faster than others, but I had no concrete data to prove it.
I needed to know if the performance differences were real or just my perception. More importantly, I wanted to understand the actual cost of quantization—not just file size, but inference speed, VRAM usage, and response quality.
My Testing Setup
I ran these tests on my Proxmox server with:
- NVIDIA RTX 3090 (24GB VRAM)
- AMD Ryzen 9 5950X
- 64GB DDR4 RAM
- NVMe storage for model files
I tested DeepSeek R1 in three formats I actually downloaded and used:
deepseek-r1:671b-0528-q4_K_M(404GB)deepseek-r1:671b-q8_0(671GB)deepseek-r1:671b-fp16(1.3TB)
I did not test every quantization level. I focused on the ones I could actually fit in my VRAM and storage constraints.
How I Measured Performance
I wrote a simple Python script that sent identical prompts to both Ollama and LM Studio, then recorded:
- Time to first token (latency)
- Tokens per second during generation
- VRAM usage during inference
- Total response time
I ran each test five times and averaged the results. I did not measure response quality systematically—that's too subjective and depends on the task.
The Testing Script
Here's what I used (simplified for clarity):
import time
import requests
import subprocess
def test_ollama(model, prompt):
start = time.time()
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
end = time.time()
data = response.json()
return {
"total_time": end - start,
"tokens": len(data["response"].split()),
"tokens_per_sec": len(data["response"].split()) / (end - start)
}
# Similar function for LM Studio
# Run tests, collect data, write to CSV
Nothing fancy. I just needed consistent measurements across runs.
What I Found: Quantization vs Speed
Q4_K_M (404GB)
This is the format I use most often. It's small enough to fit comfortably in my 24GB VRAM with room for other processes.
- Average tokens per second: ~18-22 on Ollama, ~20-24 on LM Studio
- VRAM usage: ~18GB during inference
- Time to first token: 2-3 seconds
LM Studio was slightly faster in most runs, but not by a huge margin. The difference felt more noticeable in interactive use than in the raw numbers.
Q8_0 (671GB)
I tested this to see if doubling the quantization precision made a meaningful difference.
- Average tokens per second: ~12-15 on both platforms
- VRAM usage: ~22GB (pushing my GPU limit)
- Time to first token: 3-4 seconds
The speed drop was real and consistent. Q8 models are more accurate in theory, but I couldn't reliably tell the difference in actual responses for my use cases (code generation, summarization, Q&A).
FP16 (1.3TB)
I could not run this format properly. It exceeded my VRAM even with offloading, and inference was painfully slow when parts of the model spilled to system RAM.
I aborted these tests because they weren't representative of real usage on my hardware.
Ollama vs LM Studio: Real Differences
Both tools ran the same models with similar performance, but I noticed practical differences:
What Worked Better in Ollama
- Simpler API for scripting and automation
- Easier to integrate with other tools (I use it with n8n)
- Lower memory overhead when idle
What Worked Better in LM Studio
- Slightly faster inference on Q4 models (2-3 tokens/sec difference)
- Better UI for interactive testing
- More control over inference parameters in the GUI
The speed difference wasn't dramatic. If you're scripting workflows, Ollama's API is more convenient. If you're testing models interactively, LM Studio's interface is nicer.
What Didn't Work
Consistent Benchmarking Was Hard
Small variations in system load affected results. Background processes, Docker containers, and even browser tabs with JavaScript running changed the numbers.
I ended up stopping all non-essential services before each test run, which was tedious.
VRAM Management
Both tools struggled when VRAM was tight. I had to manually unload models between tests to avoid memory fragmentation issues.
Ollama's automatic model unloading helped, but it wasn't perfect. LM Studio required manual intervention more often.
Long Context Testing
I wanted to test performance with longer prompts (8K+ tokens), but both tools became unpredictable. Some runs were fast, others timed out. I couldn't isolate the cause.
Key Takeaways
From my testing, here's what I learned:
- Q4_K_M is the sweet spot for 24GB VRAM. It's fast enough for real-time use and fits comfortably with headroom.
- Q8 quantization is slower but not noticeably better for my tasks. The file size and speed penalty aren't worth it unless you need maximum accuracy.
- LM Studio is slightly faster on consumer GPUs, but the difference is small (2-3 tokens/sec on average).
- Ollama is better for automation. If you're building workflows or integrating with other tools, its API is simpler.
- Benchmarking local LLMs is messy. Results vary with system state, and isolating variables is harder than I expected.
What I Use Now
I run Ollama with Q4_K_M models for most tasks. It's fast enough, uses reasonable VRAM, and integrates cleanly with my n8n workflows.
I keep LM Studio installed for interactive testing when I want to experiment with new models or compare outputs side-by-side.
I don't use Q8 or FP16 formats anymore. The performance cost isn't justified for my use cases, and the storage requirements are impractical on my current hardware.