implementing llm response caching with redis: reducing ollama inference costs for repeated queries in n8n workflows

Why I Built LLM Response Caching with Redis

I run several n8n workflows that use Ollama for text processing, summarization, and question answering. The problem became obvious fast: users would ask the same question in slightly different ways, and my workflows would waste 5–10 seconds regenerating identical answers every single time.

My Ollama instance runs on a Proxmox VM with limited GPU resources. Every repeated inference meant wasted compute, slower responses, and frustrated users. I needed a way to recognize semantically similar queries and return cached responses instantly.

implementing llm response caching with redis: reducing ollama inference costs for repeated queries in n8n workflows

That’s when I started experimenting with Redis and semantic caching.

My Setup

Here’s what I’m actually running:

Ollama on a Proxmox VM (Llama 3.2 for responses, nomic-embed-text for embeddings)
Redis running in a Docker container on the same Proxmox host
n8n workflows that call Ollama via HTTP nodes
Python script as a middleware layer between n8n and Ollama

The goal was simple: check Redis first for similar queries, only hit Ollama if there’s no match.

How Semantic Caching Actually Works

Traditional caching stores exact matches. If I ask “What is Redis?” and then “Tell me about Redis”, a normal cache misses completely.

Semantic caching works differently. It converts queries into vector embeddings—numerical representations of meaning—and stores them in Redis. When a new query comes in, the system:

Generates an embedding for the new query
Searches Redis for similar embeddings (using vector similarity)
Returns the cached response if the similarity is above a threshold
Calls Ollama and stores the new response if no match exists

This means “What is Redis?” and “Explain Redis to me” can return the same cached result in milliseconds instead of regenerating it.

Building the Cache Layer

I used the redisvl library because it has built-in support for semantic caching. The tricky part was integrating Ollama’s embedding model.

First, I created a custom vectorizer that wraps Ollama’s nomic-embed-text model:

from langchain_ollama import OllamaEmbeddings
from redisvl.utils.vectorize import CustomTextVectorizer
import asyncio
from typing import List

def create_vectorizer():
    ollama_embedder = OllamaEmbeddings(model='nomic-embed-text')
    
    def sync_embed(text: str) -> List[float]:
        return ollama_embedder.embed_query(text)
    
    def sync_embed_many(texts: List[str]) -> List[List[float]]:
        return ollama_embedder.embed_documents(texts)
    
    async def async_embed(text: str) -> List[float]:
        return await asyncio.to_thread(sync_embed, text)
    
    async def async_embed_many(texts: List[str]) -> List[List[float]]:
        return await asyncio.to_thread(sync_embed_many, texts)
    
    return CustomTextVectorizer(
        embed=sync_embed,
        aembed=async_embed,
        embed_many=sync_embed_many,
        aembed_many=async_embed_many
    )

This gives Redis a way to generate embeddings using my local Ollama instance instead of calling an external API.

The Main Caching Logic

Next, I built the actual cache check and store logic:

import time
from langchain_ollama import ChatOllama
from redisvl.extensions.llmcache import SemanticCache
from cache_vectorizer import vectorizer

llmcache = SemanticCache(
    name="OllamaLLMCache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,
    vectorizer=vectorizer,
    dimension=768
)

client = ChatOllama(model="llama3.2")

def ask_ollama(question: str) -> str:
    response = client.invoke(input=question)
    return response.content

question = "What is Docker?"

start_time = time.time()
cached_response = llmcache.check(prompt=question)
cache_time = time.time() - start_time

if cached_response:
    print("Cache hit!")
    print(f"Response: {cached_response[0]['response']}")
    print(f"Time taken: {cache_time:.4f} seconds")
else:
    start_time = time.time()
    response = ask_ollama(question)
    llm_time = time.time() - start_time
    
    print("No cache hit. LLM response:")
    print(f"Response: {response}")
    print(f"Time taken: {llm_time:.4f} seconds")
    
    llmcache.store(prompt=question, response=response)
    print("Stored in cache!")

The distance_threshold parameter controls how similar a query needs to be to match. I started with 0.1, which is fairly strict. Lower values mean more exact matches, higher values allow looser similarity.

Integrating with n8n

My n8n workflows call this Python script via an HTTP Request node. The workflow sends a JSON payload with the user’s question, the script checks Redis, and returns either the cached response or a fresh Ollama result.

Here’s what a typical n8n flow looks like:

Webhook receives user input
HTTP Request node calls my Python cache service
If cache hit: return response immediately
If cache miss: script calls Ollama, stores result, returns response
n8n sends the response back to the user

This keeps the caching logic separate from n8n, which makes debugging and updates much easier.

What Actually Worked

After running this for a few weeks, here’s what I observed:

Response times dropped dramatically: Cache hits return in 50–150ms instead of 5–10 seconds
Ollama load decreased: About 70% of queries now hit the cache
Semantic matching works surprisingly well: Questions like “How do I use Docker?” and “Explain Docker usage” both hit the same cached result
Redis memory usage is minimal: Even with hundreds of cached responses, Redis uses less than 100MB

The biggest win was in my documentation Q&A workflow. Users constantly ask variations of the same questions, and now they get instant answers without hammering Ollama.

What Didn’t Work

Not everything went smoothly:

Initial threshold tuning was annoying: I started with a threshold of 0.2, which matched too many unrelated queries. A user asking “What is Proxmox?” would sometimes get a cached answer about Docker. Lowering it to 0.1 fixed this, but it took trial and error.

Embedding generation adds latency: Even on cache misses, I now generate embeddings twice—once to check the cache, once to store the result. This adds about 200–300ms compared to calling Ollama directly. For my use case, that’s acceptable, but it’s a real trade-off.

Redis persistence isn’t automatic: I initially ran Redis without persistence enabled. After a container restart, all my cached responses disappeared. I fixed this by enabling RDB snapshots, but I should have done that from the start.

Cache invalidation is manual: If I update my knowledge base or change how Ollama answers certain questions, old cached responses stick around. I haven’t built automatic invalidation yet, so I occasionally flush the cache manually.

Key Takeaways

Semantic caching with Redis is practical and effective for local LLM setups
Tuning the distance threshold is critical—too high and you get wrong answers, too low and you miss valid matches
Embedding generation has a cost, but it’s worth it when cache hit rates are high
Redis persistence matters if you don’t want to rebuild your cache after every restart
This approach works best for workflows with repetitive queries, not one-off requests

What I’d Change

If I were starting over, I’d add:

Cache invalidation logic: Track when responses were cached and expire them after a set period
Monitoring: Log cache hit/miss rates to see which queries benefit most
Multiple thresholds: Different workflows might need different similarity levels
Fallback handling: If Redis goes down, the script should still call Ollama instead of failing

Overall, this setup has made my n8n workflows faster and more efficient. If you’re running local LLMs and dealing with repetitive queries, semantic caching is worth the effort.

Tech Expert & Vibe Coder

implementing llm response caching with redis: reducing ollama inference costs for repeated queries in n8n workflows

Why I Built LLM Response Caching with Redis

My Setup

How Semantic Caching Actually Works

Building the Cache Layer

The Main Caching Logic

Integrating with n8n

What Actually Worked

What Didn’t Work

Key Takeaways

What I’d Change

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built LLM Response Caching with Redis

My Setup

How Semantic Caching Actually Works

Building the Cache Layer

The Main Caching Logic

Integrating with n8n

What Actually Worked

What Didn’t Work

Key Takeaways

What I’d Change

running deepseek-v3 on consumer hardware: quantization strategies and vram optimization for 685b parameter models

debugging raspberry pi 5 performance bottlenecks when hosting llama 3.3 70b: thermal throttling vs memory bandwidth limits

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.