Building a Local LLM Response Cache with Redis: Reducing Inference Costs and Latency for Repeated Queries

Why I Built a Local LLM Response Cache

I run multiple LLMs locally—Mistral, Llama variants, and sometimes Qwen for specific tasks. These models live on my Proxmox cluster, served through containers running vLLM or Ollama depending on what I’m testing. The problem I kept hitting wasn’t model quality or speed. It was waste.

When I ask the same question twice, or when my automation scripts query the model with similar prompts, the LLM recalculates everything from scratch. Every token gets processed again. Every attention layer fires again. The GPU churns through the same math, burning watts and time for answers I already got five minutes ago.

Building a Local LLM Response Cache with Redis: Reducing Inference Costs and Latency for Repeated Queries

I needed a way to recognize when a prompt was semantically identical to something I’d already asked, and return the cached response instead of hitting the model again. That’s when I started looking at Redis as a caching layer.

My Setup: Local LLMs + Redis on Proxmox

I run my LLMs in Docker containers on Proxmox VMs. The main serving engine is vLLM, which handles inference for models like Mistral-7B-Instruct. I have a separate Redis instance running in another container on the same host network, so latency between the LLM server and Redis is minimal.

My workflow looks like this:

Prompt comes in via API or n8n automation
Before sending it to the LLM, I check Redis for a semantically similar cached response
If found, return the cached result immediately
If not, run inference and store the result in Redis for future use

The key difference from traditional caching is that I’m not matching exact strings. I’m comparing the meaning of prompts using embeddings.

How Semantic Caching Actually Works

When a prompt arrives, I convert it into a vector embedding using a small local model like all-MiniLM-L6-v2. This embedding is a numerical representation of the prompt’s meaning—not its exact words.

I then compute a similarity score between this new embedding and embeddings of previously cached prompts stored in Redis. If the similarity crosses a threshold (I use 0.85), I treat it as a cache hit and return the stored response.

Here’s what that looks like in practice:

import redis
import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize Redis and embedding model
r = redis.Redis(host='localhost', port=6379, decode_responses=False)
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text):
    return model.encode(text)

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def check_cache(prompt, threshold=0.85):
    prompt_embedding = get_embedding(prompt)
    
    # Scan all cached prompts
    for key in r.scan_iter("prompt:*"):
        cached_embedding = np.frombuffer(r.hget(key, "embedding"), dtype=np.float32)
        similarity = cosine_similarity(prompt_embedding, cached_embedding)
        
        if similarity >= threshold:
            return r.hget(key, "response").decode('utf-8')
    
    return None

def cache_response(prompt, response):
    prompt_embedding = get_embedding(prompt)
    key = f"prompt:{hash(prompt)}"
    
    r.hset(key, mapping={
        "embedding": prompt_embedding.tobytes(),
        "response": response,
        "prompt": prompt
    })
    r.expire(key, 86400)  # 24 hour TTL

This approach works because semantically similar prompts produce similar embeddings, even if the wording differs.

What Worked

The cache hit rate was higher than I expected. For my automation workflows—especially those involving document summarization and repeated queries about system status—I saw around 40% of requests hitting the cache after the first day of use.

Response times for cached queries dropped from 2-3 seconds (inference time) to under 100ms. That’s a meaningful difference when you’re chaining multiple LLM calls together in an n8n workflow.

Using Redis instead of an in-memory dictionary meant the cache persisted across container restarts and was accessible from multiple processes. I could query the LLM from different automation scripts and still benefit from shared cache entries.

The embedding model (all-MiniLM-L6-v2) is small enough to run on CPU without noticeable overhead. Generating embeddings adds maybe 50ms to each request, which is negligible compared to LLM inference time.

What Didn’t Work

The similarity threshold is tricky. Set it too high (0.95+) and you miss legitimate cache hits. Set it too low (0.75) and you get false positives where the cached response doesn’t actually match the intent of the new prompt.

I started with 0.90 and found it too strict. Lowering it to 0.85 improved hit rates without introducing bad matches, but this required testing with real queries from my workflows.

Scanning all cached embeddings on every request doesn’t scale. With a few hundred cached prompts, it’s fine. With thousands, the linear scan becomes a bottleneck. Redis doesn’t natively support vector similarity search in the open source version I’m using, so I had to accept this limitation or consider vector databases like Qdrant.

For now, I keep the cache size bounded by using short TTLs (24 hours) and limiting cache entries to high-value queries that I know will repeat.

Another issue: the cache doesn’t account for context changes. If I ask “What’s the weather?” twice, but the weather has changed between requests, the cached response is stale. For time-sensitive or stateful queries, semantic caching can return outdated information.

When This Makes Sense

Semantic caching with Redis works well when:

You have repeated queries with slight variations in wording
The LLM responses don’t need to be regenerated every time
You’re running local models where inference cost is measured in GPU time and power consumption
Your queries are bounded and predictable (automation, internal tools, specific use cases)

It’s less useful for:

Highly creative or open-ended generation tasks
Queries where freshness is critical
Scenarios where cache misses are rare (you’re better off optimizing inference directly)

Key Takeaways

Semantic caching is a practical way to reduce redundant LLM inference when you control the infrastructure. Using Redis as the cache backend worked because it’s fast, persistent, and easy to integrate with existing Python workflows.

The biggest gains came from workflows with repetitive queries—summarization tasks, status checks, and Q&A over static documents. For these use cases, caching cut response times by more than 90% and reduced GPU load noticeably.

The main limitation is scalability. Without vector search support, scanning embeddings becomes a problem as the cache grows. For small-scale or bounded use cases, this approach is sufficient. For larger deployments, a purpose-built vector database would be necessary.

If you’re running local LLMs and seeing the same questions come up repeatedly, this is worth implementing. The code is straightforward, the performance improvement is immediate, and Redis is something you’re probably already running.

Tech Expert & Vibe Coder

Building a Local LLM Response Cache with Redis: Reducing Inference Costs and Latency for Repeated Queries

Why I Built a Local LLM Response Cache

My Setup: Local LLMs + Redis on Proxmox

How Semantic Caching Actually Works

What Worked

What Didn’t Work

When This Makes Sense

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built a Local LLM Response Cache

My Setup: Local LLMs + Redis on Proxmox

How Semantic Caching Actually Works

What Worked

What Didn’t Work

When This Makes Sense

Key Takeaways

Debugging CUDA Out-of-Memory Errors in Ollama Multi-Model Deployments: Memory Pooling Strategies for 24GB VRAM Limits

Setting Up Prometheus Metrics for LM Studio API Endpoints: Tracking Token Usage and Response Times with Custom Exporters

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.