Why I Started Looking Into This
I run LM Studio on my Proxmox server to handle local AI inference for various automation tasks. The setup worked perfectly for the first few days after loading a model, but I noticed something odd: after about a week of uptime, token generation would slow to a crawl. What started as 40-50 tokens per second would drop to 5-10, sometimes lower.
This wasn’t a gradual decline. The model would work fine, then suddenly tank. Restarting LM Studio fixed it immediately, but that meant interrupting running tasks and losing whatever context was loaded. I needed to understand what was actually happening.
My Setup and Initial Observations
I’m running LM Studio 0.2.x on a Proxmox VM with:
- 32GB RAM allocated
- NVIDIA GPU passed through (RTX 3060)
- Models stored on NFS mount from my Synology
- Ubuntu 22.04 as the guest OS
The model I use most is a 13B parameter quantized version (Q4_K_M). Under normal conditions, it generates text quickly and handles concurrent requests from my n8n workflows without issue.
The slowdown pattern was consistent: after 5-7 days of continuous operation, performance would degrade. It didn’t matter if I was making heavy use of the model or if it sat mostly idle. The uptime itself seemed to be the trigger.
What I Tried First (That Didn’t Work)
My initial assumption was memory pressure. I checked RAM usage, GPU memory, and system logs. Nothing stood out. The VM had plenty of free memory, and the GPU wasn’t showing any thermal throttling or memory exhaustion.
I tried:
- Increasing the VM’s RAM allocation to 48GB
- Adjusting LM Studio’s context window settings
- Monitoring disk I/O on the NFS mount
- Checking for CPU steal time in Proxmox
None of these made a difference. The slowdown still happened on the same timeline, regardless of resource availability.
Finding the Actual Problem
I started digging into LM Studio’s local data directory. On Linux, this lives at ~/.cache/lm-studio/. Inside, there’s a subdirectory structure that includes model caches, conversation history, and various runtime files.
What caught my attention was the size of the cache directory. After a fresh start, it would be a few hundred megabytes. After a week, it had grown to over 8GB. This wasn’t just conversation history—something was accumulating in the model cache itself.
I examined the cache files more closely. LM Studio appears to cache processed model data, likely to speed up repeated inference operations. But these cache files weren’t being cleaned up or rotated. They just kept growing.
When I deleted the cache directory and restarted LM Studio, performance returned to normal immediately. The model reloaded, rebuilt its cache from scratch, and token generation was back to 40+ tokens per second.
Why This Happens
From what I can tell, LM Studio’s caching mechanism doesn’t have aggressive garbage collection. Over time, as you run different prompts and contexts, the cache accumulates stale or fragmented data. The model still tries to reference this cache during inference, which slows down the entire generation process.
This isn’t a memory leak in the traditional sense—the system isn’t running out of RAM. It’s more like cache corruption or bloat. The cached data becomes less useful but still gets checked during token generation, adding overhead.
Implementing Automatic Recovery
I didn’t want to manually restart LM Studio every week, so I built a simple monitoring and recovery system using a bash script and cron.
The Monitoring Script
I created a script that:
- Checks LM Studio’s process uptime
- Monitors the cache directory size
- Tests token generation speed with a standard prompt
- Triggers a restart if performance drops below a threshold
Here’s the core logic:
#!/bin/bash
CACHE_DIR="$HOME/.cache/lm-studio"
MAX_CACHE_SIZE_GB=5
MIN_TOKENS_PER_SEC=20
TEST_PROMPT="Explain quantum computing in one sentence."
# Check cache size
cache_size=$(du -s "$CACHE_DIR" | awk '{print $1}')
cache_size_gb=$((cache_size / 1024 / 1024))
# Test token generation speed
start_time=$(date +%s)
response=$(curl -s -X POST http://localhost:1234/v1/completions
-H "Content-Type: application/json"
-d "{"prompt": "$TEST_PROMPT", "max_tokens": 50}"
| jq -r '.choices[0].text')
end_time=$(date +%s)
# Calculate rough tokens per second
duration=$((end_time - start_time))
tokens_generated=$(echo "$response" | wc -w)
tokens_per_sec=$((tokens_generated / duration))
# Decide if restart is needed
if [ "$cache_size_gb" -gt "$MAX_CACHE_SIZE_GB" ] || [ "$tokens_per_sec" -lt "$MIN_TOKENS_PER_SEC" ]; then
echo "Performance degraded. Restarting LM Studio..."
pkill -f "lm-studio"
rm -rf "$CACHE_DIR"/*
nohup /path/to/lm-studio &
fi
This script runs every 6 hours via cron. It’s not perfect—the token-per-second calculation is rough and doesn’t account for prompt complexity—but it’s good enough to catch obvious slowdowns.
Improvements I Made
The initial version was too aggressive. It would restart LM Studio even when performance dips were temporary (like when the model was handling a particularly complex prompt). I added a retry mechanism: if the test fails once, wait 10 minutes and test again. Only restart if both tests show degraded performance.
I also added logging so I could track when restarts happened and correlate them with actual usage patterns:
echo "$(date): Cache size ${cache_size_gb}GB, Speed ${tokens_per_sec} t/s" >> /var/log/lm-studio-monitor.log
Over time, I noticed that cache size alone was a better predictor than token speed. If the cache exceeded 4-5GB, performance was almost always degraded. So I simplified the script to focus primarily on cache size, with token speed as a secondary check.
What I Learned
This problem taught me a few things about running long-lived inference services:
- Cache management matters more than I expected. I assumed LM Studio would handle this internally, but it doesn’t. Many self-hosted AI tools don’t have robust cache cleanup.
- Uptime isn’t always a virtue. For some services, periodic restarts are healthy. Treating every service like it needs 99.9% uptime creates unnecessary complexity.
- Monitoring needs to be specific. Generic system metrics (CPU, RAM, disk) didn’t reveal this issue. I had to monitor the actual behavior I cared about: token generation speed.
- Automation doesn’t have to be perfect. My script is crude, but it solves the problem. I didn’t need a sophisticated monitoring stack.
Current State
The automated recovery system has been running for about three months now. LM Studio restarts roughly once a week, usually triggered by cache size rather than token speed. The restarts happen during low-usage periods (I schedule the cron job to run at 3 AM), so they rarely interrupt active tasks.
I still don’t fully understand why LM Studio’s cache grows the way it does. I’ve looked through the GitHub issues and documentation, but there’s no clear explanation of the caching strategy or cleanup policies. It’s possible this is fixed in newer versions—I haven’t upgraded yet because the current setup works.
If you’re running LM Studio or similar local inference tools over long periods, watch the cache directories. If performance degrades over time, start there before assuming it’s a hardware or configuration issue.