# Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs
Why I Built This
I run several AI-powered workflows on my self-hosted infrastructure. Some use OpenAI’s API, others hit Anthropic or local models through Ollama. The common problem: repeated queries that are semantially identical but worded differently were costing me money and time.
A user asks “What’s the weather like today?” and ten minutes later someone else asks “How’s the weather right now?” — these should return the same cached response, but traditional key-value caching treats them as completely different requests.
I needed semantic caching: a system that understands when two queries mean the same thing, even if the words differ. This would cut my API costs and speed up response times for my n8n workflows and custom tools.
My Setup and Context
I’m running this on my Proxmox home lab:
- PostgreSQL container with pgvector extension for vector storage
- Python service that sits between my applications and LLM APIs
- OpenAI’s text-embedding-3-small model for generating embeddings (cheap and fast)
- Redis for metadata and quick lookups
The architecture is simple: when a query comes in, I generate its embedding, search for similar cached embeddings, and if I find a match above my similarity threshold, I return the cached response instead of hitting the LLM API.
How I Actually Implemented It
Setting Up pgvector
I started with a basic PostgreSQL 16 container and installed the pgvector extension. The schema I settled on looks like this:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE semantic_cache (
id SERIAL PRIMARY KEY,
query_text TEXT NOT NULL,
query_embedding vector(1536),
response_text TEXT NOT NULL,
model_name TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
hit_count INTEGER DEFAULT 0,
last_accessed TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON semantic_cache
USING ivfflat (query_embedding vector_cosine_ops)
WITH (lists = 100);
The vector dimension is 1536 because that’s what OpenAI’s text-embedding-3-small outputs. I use cosine similarity for matching because it works well for semantic similarity regardless of vector magnitude.
The Caching Logic
My Python service does this for every incoming query:
- Generate embedding for the incoming query using OpenAI’s embedding API
- Search pgvector for similar embeddings using cosine similarity
- If similarity score is above 0.92, return cached response
- If not, call the actual LLM API, cache the result with its embedding
The threshold of 0.92 came from testing. I tried 0.85 first but got too many false positives where semantically different queries matched. At 0.95, I missed obvious matches. 0.92 feels right for my use cases.
import openai
import psycopg2
from pgvector.psycopg2 import register_vector
def get_embedding(text):
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def search_cache(query_text, model_name, threshold=0.92):
embedding = get_embedding(query_text)
with psycopg2.connect(DATABASE_URL) as conn:
register_vector(conn)
with conn.cursor() as cur:
cur.execute("""
SELECT
id,
response_text,
1 - (query_embedding %s::vector) as similarity
FROM semantic_cache
WHERE model_name = %s
ORDER BY query_embedding %s::vector
LIMIT 1
""", (embedding, model_name, embedding))
result = cur.fetchone()
if result and result[2] >= threshold:
# Update hit count and last accessed
cur.execute("""
UPDATE semantic_cache
SET hit_count = hit_count + 1,
last_accessed = NOW()
WHERE id = %s
""", (result[0],))
conn.commit()
return result[1]
return None
What I Learned About Similarity Thresholds
Different types of queries need different thresholds. For factual questions like “What’s the capital of France?”, I can use a higher threshold (0.94) because the semantic space is narrow. For open-ended queries like “Tell me about machine learning”, I need a lower threshold (0.88) because there’s more variation in how people phrase things.
I ended up adding a query_type field to my cache table and adjusting thresholds dynamically based on detected query patterns. This required some manual categorization at first, but it improved hit rates significantly.
Real Performance Numbers
Over two weeks of running this in production with my n8n workflows:
- Cache hit rate: 34% (meaning one-third of queries were served from cache)
- Average response time for cache hits: 180ms vs 2.8s for API calls
- Cost savings: approximately $47 in API fees (I process around 3,000 queries per week)
- Embedding generation cost: about $2 for the same period
The 34% hit rate surprised me. I expected higher, but many of my workflows involve time-sensitive or context-specific queries that legitimately shouldn’t match previous ones.
Problems I Hit
Stale Cache Data
The biggest issue: responses that were correct when cached became outdated. A query about “current Bitcoin price” cached at 9 AM shouldn’t be returned at 3 PM.
I added a TTL system based on query patterns:
def get_ttl_for_query(query_text):
time_sensitive_keywords = ['current', 'now', 'today', 'latest', 'recent']
if any(keyword in query_text.lower() for keyword in time_sensitive_keywords):
return 3600 # 1 hour
return 86400 * 7 # 7 days for general queries
I run a cleanup job every hour that removes expired entries based on their created_at timestamp and calculated TTL.
Context-Dependent Queries
Some queries need conversation context to make sense. “What about Python?” means nothing without knowing the previous question was about programming languages.
My solution: I hash the last 3 messages in a conversation thread and include that hash in the cache key. This way, semantically similar queries only match if they appear in similar conversation contexts.
def get_context_hash(conversation_history):
if not conversation_history:
return "no_context"
recent_messages = conversation_history[-3:]
context_string = " ".join([msg['content'] for msg in recent_messages])
return hashlib.md5(context_string.encode()).hexdigest()[:8]
This reduced false positives by about 60% for conversational workflows.
Embedding API Latency
Generating embeddings for every query adds latency. OpenAI’s embedding endpoint typically responds in 100-200ms, but that’s still overhead.
I tried caching embeddings themselves in Redis with the query text as the key, but the hit rate was too low (only 8%) because even small wording changes meant new embeddings. The complexity wasn’t worth it.
What did work: batching embedding requests when possible. For workflows that generate multiple queries at once, I send them all in a single embedding API call. This cut embedding costs by about 40% for those specific workflows.
When This Doesn’t Work
Semantic caching isn’t useful for:
- Highly personalized queries where user context matters
- Creative generation tasks where variety is the goal
- Queries requiring real-time data
- Low-volume applications where the infrastructure overhead exceeds savings
I disabled semantic caching for my creative writing workflows because users explicitly want different outputs for similar prompts. The cache was counterproductive there.
Monitoring and Maintenance
I track these metrics in Grafana:
- Cache hit rate by hour and query type
- Average similarity scores for hits vs misses
- Response time distribution (cache vs API)
- Database size and query performance
The pgvector index needs occasional maintenance. I run VACUUM and REINDEX weekly because the ivfflat index can degrade with frequent updates. This keeps query times under 50ms even with 100k+ cached entries.
Cost-Benefit Reality Check
Building and maintaining this system took about 12 hours of my time. At my current usage levels, I’m saving roughly $200/month in API costs. The embedding costs add about $8/month.
If you’re processing fewer than 10,000 queries per month, traditional key-value caching with exact string matching is probably enough. The complexity of semantic caching only pays off at scale or when query variation is high.
For my use case — multiple users asking similar questions through different interfaces — it’s been worth it. The speed improvement alone justifies the effort, even ignoring cost savings.
Key Takeaways
- Semantic caching works best for high-volume, repetitive query patterns with natural language variation
- Similarity thresholds need tuning based on your specific query types
- Context matters: conversational queries need conversation-aware cache keys
- TTL strategies must match query semantics (time-sensitive vs evergreen)
- Monitoring is critical because cache effectiveness changes as usage patterns evolve
- The infrastructure cost and complexity only make sense above a certain query volume
I’m still tuning threshold values and TTL rules based on observed patterns. This isn’t a set-it-and-forget-it system — it requires ongoing adjustment as my workflows and query patterns change.