Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

implementing llm response caching with redis: reducing ollama inference costs for repeated queries in n8n workflows

Why I Built LLM Response Caching with Redis

I run several n8n workflows that use Ollama for text processing, summarization, and question answering. The problem became obvious fast: users would ask the same question in slightly different ways, and my workflows would waste 5–10 seconds regenerating identical answers every single time.

My Ollama instance runs on a Proxmox VM with limited GPU resources. Every repeated inference meant wasted compute, slower responses, and frustrated users. I needed a way to recognize semantically similar queries and return cached responses instantly.

That’s when I started experimenting with Redis and semantic caching.

My Setup

Here’s what I’m actually running:

  • Ollama on a Proxmox VM (Llama 3.2 for responses, nomic-embed-text for embeddings)
  • Redis running in a Docker container on the same Proxmox host
  • n8n workflows that call Ollama via HTTP nodes
  • Python script as a middleware layer between n8n and Ollama

The goal was simple: check Redis first for similar queries, only hit Ollama if there’s no match.

How Semantic Caching Actually Works

Traditional caching stores exact matches. If I ask “What is Redis?” and then “Tell me about Redis”, a normal cache misses completely.

Semantic caching works differently. It converts queries into vector embeddings—numerical representations of meaning—and stores them in Redis. When a new query comes in, the system:

  1. Generates an embedding for the new query
  2. Searches Redis for similar embeddings (using vector similarity)
  3. Returns the cached response if the similarity is above a threshold
  4. Calls Ollama and stores the new response if no match exists

This means “What is Redis?” and “Explain Redis to me” can return the same cached result in milliseconds instead of regenerating it.

Building the Cache Layer

I used the redisvl library because it has built-in support for semantic caching. The tricky part was integrating Ollama’s embedding model.

First, I created a custom vectorizer that wraps Ollama’s nomic-embed-text model:

from langchain_ollama import OllamaEmbeddings
from redisvl.utils.vectorize import CustomTextVectorizer
import asyncio
from typing import List

def create_vectorizer():
    ollama_embedder = OllamaEmbeddings(model='nomic-embed-text')
    
    def sync_embed(text: str) -> List[float]:
        return ollama_embedder.embed_query(text)
    
    def sync_embed_many(texts: List[str]) -> List[List[float]]:
        return ollama_embedder.embed_documents(texts)
    
    async def async_embed(text: str) -> List[float]:
        return await asyncio.to_thread(sync_embed, text)
    
    async def async_embed_many(texts: List[str]) -> List[List[float]]:
        return await asyncio.to_thread(sync_embed_many, texts)
    
    return CustomTextVectorizer(
        embed=sync_embed,
        aembed=async_embed,
        embed_many=sync_embed_many,
        aembed_many=async_embed_many
    )

This gives Redis a way to generate embeddings using my local Ollama instance instead of calling an external API.

The Main Caching Logic

Next, I built the actual cache check and store logic:

import time
from langchain_ollama import ChatOllama
from redisvl.extensions.llmcache import SemanticCache
from cache_vectorizer import vectorizer

llmcache = SemanticCache(
    name="OllamaLLMCache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,
    vectorizer=vectorizer,
    dimension=768
)

client = ChatOllama(model="llama3.2")

def ask_ollama(question: str) -> str:
    response = client.invoke(input=question)
    return response.content

question = "What is Docker?"

start_time = time.time()
cached_response = llmcache.check(prompt=question)
cache_time = time.time() - start_time

if cached_response:
    print("Cache hit!")
    print(f"Response: {cached_response[0]['response']}")
    print(f"Time taken: {cache_time:.4f} seconds")
else:
    start_time = time.time()
    response = ask_ollama(question)
    llm_time = time.time() - start_time
    
    print("No cache hit. LLM response:")
    print(f"Response: {response}")
    print(f"Time taken: {llm_time:.4f} seconds")
    
    llmcache.store(prompt=question, response=response)
    print("Stored in cache!")

The distance_threshold parameter controls how similar a query needs to be to match. I started with 0.1, which is fairly strict. Lower values mean more exact matches, higher values allow looser similarity.

Integrating with n8n

My n8n workflows call this Python script via an HTTP Request node. The workflow sends a JSON payload with the user’s question, the script checks Redis, and returns either the cached response or a fresh Ollama result.

Here’s what a typical n8n flow looks like:

  1. Webhook receives user input
  2. HTTP Request node calls my Python cache service
  3. If cache hit: return response immediately
  4. If cache miss: script calls Ollama, stores result, returns response
  5. n8n sends the response back to the user

This keeps the caching logic separate from n8n, which makes debugging and updates much easier.

What Actually Worked

After running this for a few weeks, here’s what I observed:

  • Response times dropped dramatically: Cache hits return in 50–150ms instead of 5–10 seconds
  • Ollama load decreased: About 70% of queries now hit the cache
  • Semantic matching works surprisingly well: Questions like “How do I use Docker?” and “Explain Docker usage” both hit the same cached result
  • Redis memory usage is minimal: Even with hundreds of cached responses, Redis uses less than 100MB

The biggest win was in my documentation Q&A workflow. Users constantly ask variations of the same questions, and now they get instant answers without hammering Ollama.

What Didn’t Work

Not everything went smoothly:

Initial threshold tuning was annoying: I started with a threshold of 0.2, which matched too many unrelated queries. A user asking “What is Proxmox?” would sometimes get a cached answer about Docker. Lowering it to 0.1 fixed this, but it took trial and error.

Embedding generation adds latency: Even on cache misses, I now generate embeddings twice—once to check the cache, once to store the result. This adds about 200–300ms compared to calling Ollama directly. For my use case, that’s acceptable, but it’s a real trade-off.

Redis persistence isn’t automatic: I initially ran Redis without persistence enabled. After a container restart, all my cached responses disappeared. I fixed this by enabling RDB snapshots, but I should have done that from the start.

Cache invalidation is manual: If I update my knowledge base or change how Ollama answers certain questions, old cached responses stick around. I haven’t built automatic invalidation yet, so I occasionally flush the cache manually.

Key Takeaways

  • Semantic caching with Redis is practical and effective for local LLM setups
  • Tuning the distance threshold is critical—too high and you get wrong answers, too low and you miss valid matches
  • Embedding generation has a cost, but it’s worth it when cache hit rates are high
  • Redis persistence matters if you don’t want to rebuild your cache after every restart
  • This approach works best for workflows with repetitive queries, not one-off requests

What I’d Change

If I were starting over, I’d add:

  • Cache invalidation logic: Track when responses were cached and expire them after a set period
  • Monitoring: Log cache hit/miss rates to see which queries benefit most
  • Multiple thresholds: Different workflows might need different similarity levels
  • Fallback handling: If Redis goes down, the script should still call Ollama instead of failing

Overall, this setup has made my n8n workflows faster and more efficient. If you’re running local LLMs and dealing with repetitive queries, semantic caching is worth the effort.

Leave a Comment

Your email address will not be published. Required fields are marked *