Benchmarking RTX 5090 vs 4090 for Local LLM Inference: Real-World Token/Second Gains with Ollama and LM Studio

Why I Benchmarked the RTX 5090 Against My 4090

I’ve been running local LLMs on my RTX 4090 for over a year now. My setup includes Ollama for quick CLI interactions and LM Studio for more exploratory work with different models. The 4090 has been solid—fast enough for most tasks, handles 70B models reasonably well, and doesn’t bottleneck my workflow.

When NVIDIA announced the RTX 5090, I wasn’t planning to upgrade. The 4090 works. But I kept hitting the same friction point: context window performance. Loading a 32K token context with a 70B model would slow to a crawl. Batch processing with longer prompts meant waiting. I wanted to know if the 5090’s advertised memory bandwidth and architectural changes would actually translate to meaningful speed gains in my real usage—not synthetic benchmarks, but the models and workflows I use daily.

So I got one. This isn’t a comprehensive review. It’s a direct comparison of what changed for me when swapping cards in the same system, running the same models, with the same software.

My Testing Setup

Hardware stayed consistent except for the GPU swap:

CPU: AMD Ryzen 9 7950X
RAM: 64GB DDR5-6000
Storage: Samsung 990 Pro 2TB NVMe (models stored here)
PSU: Corsair HX1500i (the 5090 needs headroom)
Cooling: Custom loop, GPU block swapped between tests

Software versions:

Ollama 0.1.29 (latest stable when I tested)
LM Studio 0.2.19
CUDA 12.3
Ubuntu 22.04 LTS

I tested models I actually use, not every variant available:

Llama 3.1 70B (Q4_K_M quantization)
Mixtral 8x7B (Q5_K_M)
Qwen 2.5 32B (Q6_K)
DeepSeek Coder 33B (Q5_K_M)

Each test ran three times. I recorded tokens per second during generation, not just time to first token. Context lengths varied: 2K, 8K, 16K, and 32K tokens. I used the same prompt structure for each run to keep variables controlled.

What Actually Changed: Token Generation Speed

The 5090 is faster. Not by a revolutionary margin, but consistently faster across every model and context size I tested.

Llama 3.1 70B Results

This is the model where I felt the difference most. With a 2K context:

RTX 4090: 28.3 tokens/second (average across three runs)
RTX 5090: 36.7 tokens/second

That’s about 30% faster. At 32K context, the gap widened:

RTX 4090: 11.2 tokens/second
RTX 5090: 16.8 tokens/second

The 4090 wasn’t unusable at 32K, but you could feel the lag. The 5090 stayed responsive enough that I didn’t mentally context-switch while waiting for output.

Mixtral 8x7B Results

Mixtral is less demanding overall. At 2K context:

RTX 4090: 47.1 tokens/second
RTX 5090: 58.4 tokens/second

At 16K context:

RTX 4090: 31.6 tokens/second
RTX 5090: 39.2 tokens/second

The gains here were smaller in absolute terms but still noticeable during actual use. Mixtral was already fast enough on the 4090 that the 5090 didn’t change my workflow much with this model.

Qwen 2.5 32B Results

Qwen became my go-to for coding tasks. At 8K context:

RTX 4090: 39.4 tokens/second
RTX 5090: 51.8 tokens/second

At 32K context:

RTX 4090: 18.7 tokens/second
RTX 5090: 26.3 tokens/second

This was the sweet spot where the 5090 made coding sessions smoother. I could paste larger codebases into context without the generation speed dropping to the point where I’d switch to a smaller model.

DeepSeek Coder 33B Results

At 2K context:

RTX 4090: 42.8 tokens/second
RTX 5090: 54.1 tokens/second

At 16K context:

RTX 4090: 24.3 tokens/second
RTX 5090: 32.9 tokens/second

Similar pattern to Qwen. The 5090 didn’t unlock new use cases here, but it made existing ones less frustrating.

Memory Bandwidth: Where the 5090 Pulls Ahead

The RTX 5090 has 1792 GB/s memory bandwidth compared to the 4090’s 1008 GB/s. On paper, that’s a huge jump. In practice, it matters most when the model is large and the context is long.

I monitored GPU utilization during generation. With the 4090, longer contexts would show the GPU waiting on memory more often—utilization would dip to 70-80% during generation. The 5090 stayed closer to 95% utilization across the board.

This isn’t about raw compute. It’s about feeding the compute units fast enough. The 4090 has plenty of CUDA cores for these models, but moving weights and activations through memory becomes the bottleneck. The 5090 reduces that bottleneck noticeably.

Power Draw and Thermals

The 5090 pulls more power. Under full load with Llama 70B at 32K context:

RTX 4090: ~420W sustained
RTX 5090: ~510W sustained

That’s about 90W more. My PSU handled it fine, but if you’re running a 750W or 850W unit, you might be pushing limits depending on the rest of your system.

Thermals were similar with my custom loop. The 5090 ran about 3-4°C warmer under the same load, which is negligible. Stock cooling would likely show a bigger gap, but I can’t speak to that directly.

What Didn’t Change

Model loading times were nearly identical. Both cards load a 70B Q4 model from NVMe in about 8-10 seconds. The 5090 didn’t speed this up because it’s limited by storage throughput, not GPU speed.

Quantization still matters more than the GPU upgrade. A Q4 model on the 4090 is faster than a Q6 model on the 5090. If you’re trying to decide between a better GPU and using lower quantization, go with lower quantization first.

Smaller models (7B, 13B) saw minimal gains. The 4090 was already fast enough that the 5090’s advantages didn’t matter. If you’re mostly running models under 20B parameters, the upgrade isn’t worth it.

Ollama vs LM Studio Performance

I tested both because I use both daily. Ollama is my CLI tool for quick queries. LM Studio is what I use when I want to experiment with settings or compare model outputs side by side.

Performance was nearly identical between the two on the same model and settings. Ollama felt slightly faster on initial load, but that’s likely just perception—measured generation speeds were within 1-2 tokens/second of each other.

LM Studio’s UI adds overhead, but not enough to matter. The real difference is workflow, not speed.

Real-World Impact on My Workflow

The 5090 didn’t change what I can do. It changed how long I’m willing to wait.

Before, I’d avoid loading full documentation into context because the generation would slow to 10-12 tokens/second. Now I do it routinely because 16-18 tokens/second feels responsive enough to stay in flow.

I’m using 70B models more often instead of dropping down to 32B models for speed. That’s a quality gain, not just a speed gain. Larger models give better reasoning and more accurate code suggestions.

Batch processing improved the most. I run scripts that send multiple prompts through Ollama for data processing. The 5090 cut total runtime by about 25-30% across typical batches. That’s hours saved per week.

What I Wish I’d Known Before Upgrading

The 5090 is physically larger. I had to adjust my case layout slightly. Not a dealbreaker, but measure your clearance first.

Driver support was solid on Linux, but I had to update to CUDA 12.3. Older CUDA versions caused crashes with certain quantization formats. This took an afternoon to troubleshoot.

The price difference is significant. I paid about $1600 for my 4090 at launch. The 5090 is $2000+ depending on availability. For my use case, the speed gain justified the cost. For most people, it probably doesn’t.

Key Takeaways

The RTX 5090 is 25-35% faster than the 4090 for local LLM inference in my testing. The gap is largest with 70B models and long context windows.

Memory bandwidth is the main differentiator. The 5090 doesn’t bottleneck on memory as quickly, which keeps generation speed higher when context grows.

If you’re running models under 30B parameters, the 4090 is still excellent. The 5090’s advantages don’t justify the cost for smaller models.

If you’re regularly using 70B models with 16K+ context, the 5090 makes a noticeable difference. It’s not a luxury—it’s a workflow improvement.

Power requirements are real. Make sure your PSU can handle sustained 500W+ loads if you’re considering the upgrade.

Quantization still matters more than hardware. A well-quantized model on older hardware beats a poorly-quantized model on new hardware every time.

Tech Expert & Vibe Coder

Why I Benchmarked the RTX 5090 Against My 4090

My Testing Setup