Why I Started Looking Into Speculative Decoding
I run LM Studio on my local machine for testing language models without sending data to external APIs. The problem I kept hitting was speed—or the lack of it. Even with a decent GPU, generating responses felt slow, especially for longer outputs. I’d watch tokens appear one by one, which is fine for quick questions but frustrating when you’re iterating on prompts or testing different models.
I’d heard about speculative decoding as a technique to speed up inference, but most explanations were academic or tied to cloud platforms. I wanted to know if I could actually configure it in LM Studio without needing custom code or server setups. Turns out, I could, and it made a noticeable difference.
What Speculative Decoding Actually Does
The basic idea is simple: instead of generating one token at a time, you use a smaller, faster “draft” model to guess several tokens ahead. Then the main model verifies those guesses in parallel. If the draft model got them right, you’ve just generated multiple tokens in roughly the time it would’ve taken to generate one. If it got them wrong, the main model corrects and continues.
This works because smaller models are much faster at generating tokens, even if they’re less accurate. The verification step by the larger model ensures quality doesn’t drop—you’re not accepting bad outputs, just speeding up the good ones.
My Setup in LM Studio
I was running LM Studio 0.3.x on Windows with an RTX 4070. My main model was a 7B parameter Mistral variant, which I use for general coding and writing tasks. I wanted to keep using that model but generate responses faster.
LM Studio added speculative decoding support in recent versions, but it’s not enabled by default. Here’s what I actually configured:
Choosing a Draft Model
The draft model needs to be significantly smaller than your main model—otherwise you’re not saving compute time. I tried a few options:
- A 1B parameter model from the same model family (Mistral-1B)
- A 3B parameter general-purpose model
- TinyLlama (1.1B parameters)
The 1B Mistral variant worked best for me because it shared the same tokenizer and general behavior as my main model. Using a completely different model family (like switching from Mistral to Llama) caused more rejected drafts, which slowed things down instead of speeding them up.
Configuration Steps
In LM Studio, I went to the model settings panel and found the “Speculative Decoding” section. This wasn’t immediately obvious—it’s under advanced settings, not the main configuration tab.
I set:
- Draft Model: The 1B Mistral variant I’d downloaded
- Draft Tokens: Started at 5, which means the draft model generates 5 tokens ahead before verification
- GPU Layers: Made sure both models had layers offloaded to GPU (I put the draft model fully on GPU since it’s small)
The draft tokens setting matters. Too few and you don’t get much speedup. Too many and the draft model wastes time generating tokens that get rejected. I tested 3, 5, and 8. Five worked best for my use case.
Memory Considerations
Running two models simultaneously uses more VRAM. My 7B main model was taking about 5GB, and adding the 1B draft model pushed total usage to around 6.5GB. This was fine for my 12GB card, but if you’re on 8GB or less, you might need to reduce context length or quantization settings.
I didn’t change quantization—both models stayed at Q5_K_M, which is what I normally use.
What Actually Improved
I ran the same set of prompts before and after enabling speculative decoding to compare. I didn’t do formal benchmarks, but I tracked tokens per second in LM Studio’s output panel.
Before: ~22 tokens/second on average
After: ~38 tokens/second on average
That’s not quite double, but it’s close enough to feel significant when you’re waiting for responses. Longer generations showed bigger improvements because the overhead of loading the draft model gets amortized over more tokens.
The quality of outputs didn’t change. I compared responses to the same prompts, and they were identical or near-identical. This makes sense—the main model still does all the final decisions.
Where It Helped Most
Code generation and structured outputs saw the biggest speedup. When the model is generating predictable patterns (like JSON or function definitions), the draft model guesses correctly more often. Creative writing was faster too, but less dramatically—probably because there’s more variation in what comes next.
Short responses (under 100 tokens) barely improved. The overhead of running two models probably ate into the gains.
What Didn’t Work
My first attempt used TinyLlama as the draft model with Mistral as the main model. Different tokenizers meant the draft tokens often didn’t align properly, and LM Studio showed a lot of rejected drafts in the logs. Speed actually got worse—around 18 tokens/second instead of 22.
I also tried increasing draft tokens to 10, thinking more lookahead would help. It didn’t. The draft model started making worse guesses, and the verification overhead increased. Five or six tokens ahead seemed to be the sweet spot for my setup.
Running both models entirely on CPU was pointless. I tested this out of curiosity, and generation dropped to about 3 tokens/second. Speculative decoding needs GPU acceleration to matter.
Practical Limitations
This isn’t a magic fix for all inference speed problems. If your main model is already maxing out your GPU, adding a draft model might actually slow things down due to memory bandwidth limits. I saw this when I tried it with a 13B model—my VRAM was fine, but token speed dropped slightly.
Also, the draft model needs to be good enough to make reasonable guesses. If it’s too small or too different from the main model, you’re just wasting compute on bad predictions.
Finally, this only helps with generation speed, not prompt processing. If you’re sending large contexts, that initial processing time stays the same.
Key Takeaways
- Speculative decoding in LM Studio works if you pick compatible models and have enough VRAM
- Use a draft model from the same family as your main model—shared tokenizers matter
- Start with 5 draft tokens and adjust based on your actual rejection rate
- Bigger speedups come from longer generations and structured outputs
- Don’t expect miracles on short responses or if you’re already GPU-constrained
For my workflow, this made local inference feel noticeably snappier without any quality trade-offs. It’s not something I’d recommend for every setup, but if you’re running models locally and have the VRAM headroom, it’s worth testing.