Why I Hit This Wall
I run Continue.dev in VS Code connected to local Ollama models on my Proxmox server. The setup works well for small tasks—quick refactors, single-file questions, that sort of thing. But when I started asking it to analyze larger sections of my codebase, responses became incoherent or just stopped mid-sentence.
At first I thought the model was struggling with complexity. Then I realized: I was hitting context window limits, and I had no visibility into when or how that was happening.
What I Actually Ran
My setup is straightforward. Ollama runs on a dedicated VM with 32GB RAM and GPU passthrough. I use deepseek-coder:6.7b and codellama:13b depending on the task. Continue.dev connects via Ollama's API endpoint.
The default context size in Ollama is 2048 tokens. That's tiny when you're feeding it multiple files or a long conversation history. I didn't know this until I started digging through GitHub issues and found that Ollama doesn't expose context limits clearly—and Continue.dev doesn't warn you when you exceed them.
Setting num_ctx
Ollama lets you increase context size with the num_ctx parameter. I added this to my Continue.dev config:
{
"models": [{
"title": "DeepSeek Coder",
"provider": "ollama",
"model": "deepseek-coder:6.7b",
"apiBase": "http://192.168.1.50:11434",
"requestOptions": {
"num_ctx": 8192
}
}]
}
This worked, but introduced a new problem: every time I changed num_ctx in a request, Ollama unloaded and reloaded the model. That meant 10-15 second delays between queries. Watching models reload constantly was maddening.
The Model Reload Problem
Ollama treats different num_ctx values as different model instances. If you send one request with 4096 tokens and another with 8192, it unloads the first and loads the second. This happens even if it's the same model.
I tried setting a fixed num_ctx in a Modelfile and creating a custom model variant:
FROM deepseek-coder:6.7b
PARAMETER num_ctx 8192
Then pulled it with ollama create deepseek-coder-8k -f Modelfile. This kept the context size consistent and stopped the reloading issue. But now I had to maintain separate model variants for different context sizes, which felt clunky.
What Didn't Work
I assumed Continue.dev would handle context management intelligently—trimming old messages or warning me when limits were close. It doesn't. It just sends everything to Ollama and lets the model deal with it.
Ollama's behavior when you exceed context limits is unclear. From what I observed, it silently truncates input. The system prompt stays intact, but older conversation history gets dropped. There's no error, no warning—just degraded responses.
I also tried using the embeddings API to count tokens (a workaround mentioned in GitHub issues), but that added latency and didn't integrate cleanly with Continue.dev's workflow.
What Actually Helped
I settled on three changes:
First, I created model variants with fixed num_ctx values. One at 8192 for normal work, one at 16384 for larger analysis tasks. This eliminated the reload problem.
Second, I started being more deliberate about what I fed the model. Instead of selecting entire files, I'd highlight specific functions or sections. Continue.dev's @file and @code commands let you control what gets included in the prompt.
Third, I kept conversations short. Once a thread got long, I'd start a new one rather than letting history pile up. Not elegant, but effective.
Monitoring What's Happening
Ollama logs show token counts if you run it with verbose output:
OLLAMA_DEBUG=1 ollama serve
This prints prompt and response token counts to the console. It's not real-time feedback in the editor, but it helped me understand where limits were being hit.
The Bigger Issue
The real problem is that self-hosted code assistants don't handle context limits gracefully. Continue.dev assumes the model will just work. Ollama assumes you know what you're doing. Neither gives you visibility into what's actually happening.
For small projects, 2048 tokens is fine. For anything larger, you're flying blind unless you manually track token usage or create custom model configurations.
RAG (retrieval-augmented generation) is the standard answer here—chunk your codebase, embed it, retrieve relevant pieces. But that's a separate system to build and maintain. For my use case, just increasing num_ctx and being selective about input was enough.
What I Learned
Context limits are a hard constraint, not a soft one. When you hit them, behavior degrades silently. You won't get an error—you'll just get worse results.
Ollama's default of 2048 tokens is too small for most real work. Bump it to at least 8192. If you have the RAM, go higher.
Model reloading happens when context size changes between requests. Lock it down with a Modelfile if you want consistent performance.
Continue.dev is a thin layer over the model API. It doesn't manage context for you. You need to control what goes into each prompt manually.
Self-hosting means you own the debugging process. There's no dashboard, no automatic optimization. You watch logs, adjust parameters, and figure out what works through trial and error.