Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs

# Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs

Why I Built This

I run several AI-powered workflows on my self-hosted infrastructure. Some use OpenAI’s API, others hit Anthropic or local models through Ollama. The common problem: repeated queries that are semantially identical but worded differently were costing me money and time.

Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs

A user asks “What’s the weather like today?” and ten minutes later someone else asks “How’s the weather right now?” — these should return the same cached response, but traditional key-value caching treats them as completely different requests.

I needed semantic caching: a system that understands when two queries mean the same thing, even if the words differ. This would cut my API costs and speed up response times for my n8n workflows and custom tools.

My Setup and Context

I’m running this on my Proxmox home lab:

PostgreSQL container with pgvector extension for vector storage
Python service that sits between my applications and LLM APIs
OpenAI’s text-embedding-3-small model for generating embeddings (cheap and fast)
Redis for metadata and quick lookups

The architecture is simple: when a query comes in, I generate its embedding, search for similar cached embeddings, and if I find a match above my similarity threshold, I return the cached response instead of hitting the LLM API.

How I Actually Implemented It

Setting Up pgvector

I started with a basic PostgreSQL 16 container and installed the pgvector extension. The schema I settled on looks like this:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE semantic_cache (
    id SERIAL PRIMARY KEY,
    query_text TEXT NOT NULL,
    query_embedding vector(1536),
    response_text TEXT NOT NULL,
    model_name TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    hit_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON semantic_cache 
USING ivfflat (query_embedding vector_cosine_ops) 
WITH (lists = 100);

The vector dimension is 1536 because that’s what OpenAI’s text-embedding-3-small outputs. I use cosine similarity for matching because it works well for semantic similarity regardless of vector magnitude.

The Caching Logic

My Python service does this for every incoming query:

Generate embedding for the incoming query using OpenAI’s embedding API
Search pgvector for similar embeddings using cosine similarity
If similarity score is above 0.92, return cached response
If not, call the actual LLM API, cache the result with its embedding

The threshold of 0.92 came from testing. I tried 0.85 first but got too many false positives where semantically different queries matched. At 0.95, I missed obvious matches. 0.92 feels right for my use cases.

import openai
import psycopg2
from pgvector.psycopg2 import register_vector

def get_embedding(text):
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def search_cache(query_text, model_name, threshold=0.92):
    embedding = get_embedding(query_text)
    
    with psycopg2.connect(DATABASE_URL) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            cur.execute("""
                SELECT 
                    id, 
                    response_text,
                    1 - (query_embedding  %s::vector) as similarity
                FROM semantic_cache
                WHERE model_name = %s
                ORDER BY query_embedding  %s::vector
                LIMIT 1
            """, (embedding, model_name, embedding))
            
            result = cur.fetchone()
            
            if result and result[2] >= threshold:
                # Update hit count and last accessed
                cur.execute("""
                    UPDATE semantic_cache 
                    SET hit_count = hit_count + 1,
                        last_accessed = NOW()
                    WHERE id = %s
                """, (result[0],))
                conn.commit()
                return result[1]
    
    return None

What I Learned About Similarity Thresholds

Different types of queries need different thresholds. For factual questions like “What’s the capital of France?”, I can use a higher threshold (0.94) because the semantic space is narrow. For open-ended queries like “Tell me about machine learning”, I need a lower threshold (0.88) because there’s more variation in how people phrase things.

I ended up adding a query_type field to my cache table and adjusting thresholds dynamically based on detected query patterns. This required some manual categorization at first, but it improved hit rates significantly.

Real Performance Numbers

Over two weeks of running this in production with my n8n workflows:

Cache hit rate: 34% (meaning one-third of queries were served from cache)
Average response time for cache hits: 180ms vs 2.8s for API calls
Cost savings: approximately $47 in API fees (I process around 3,000 queries per week)
Embedding generation cost: about $2 for the same period

The 34% hit rate surprised me. I expected higher, but many of my workflows involve time-sensitive or context-specific queries that legitimately shouldn’t match previous ones.

Problems I Hit

Stale Cache Data

The biggest issue: responses that were correct when cached became outdated. A query about “current Bitcoin price” cached at 9 AM shouldn’t be returned at 3 PM.

I added a TTL system based on query patterns:

def get_ttl_for_query(query_text):
    time_sensitive_keywords = ['current', 'now', 'today', 'latest', 'recent']
    
    if any(keyword in query_text.lower() for keyword in time_sensitive_keywords):
        return 3600  # 1 hour
    
    return 86400 * 7  # 7 days for general queries

I run a cleanup job every hour that removes expired entries based on their created_at timestamp and calculated TTL.

Context-Dependent Queries

Some queries need conversation context to make sense. “What about Python?” means nothing without knowing the previous question was about programming languages.

My solution: I hash the last 3 messages in a conversation thread and include that hash in the cache key. This way, semantically similar queries only match if they appear in similar conversation contexts.

def get_context_hash(conversation_history):
    if not conversation_history:
        return "no_context"
    
    recent_messages = conversation_history[-3:]
    context_string = " ".join([msg['content'] for msg in recent_messages])
    return hashlib.md5(context_string.encode()).hexdigest()[:8]

This reduced false positives by about 60% for conversational workflows.

Embedding API Latency

Generating embeddings for every query adds latency. OpenAI’s embedding endpoint typically responds in 100-200ms, but that’s still overhead.

I tried caching embeddings themselves in Redis with the query text as the key, but the hit rate was too low (only 8%) because even small wording changes meant new embeddings. The complexity wasn’t worth it.

What did work: batching embedding requests when possible. For workflows that generate multiple queries at once, I send them all in a single embedding API call. This cut embedding costs by about 40% for those specific workflows.

When This Doesn’t Work

Semantic caching isn’t useful for:

Highly personalized queries where user context matters
Creative generation tasks where variety is the goal
Queries requiring real-time data
Low-volume applications where the infrastructure overhead exceeds savings

I disabled semantic caching for my creative writing workflows because users explicitly want different outputs for similar prompts. The cache was counterproductive there.

Monitoring and Maintenance

I track these metrics in Grafana:

Cache hit rate by hour and query type
Average similarity scores for hits vs misses
Response time distribution (cache vs API)
Database size and query performance

The pgvector index needs occasional maintenance. I run VACUUM and REINDEX weekly because the ivfflat index can degrade with frequent updates. This keeps query times under 50ms even with 100k+ cached entries.

Cost-Benefit Reality Check

Building and maintaining this system took about 12 hours of my time. At my current usage levels, I’m saving roughly $200/month in API costs. The embedding costs add about $8/month.

If you’re processing fewer than 10,000 queries per month, traditional key-value caching with exact string matching is probably enough. The complexity of semantic caching only pays off at scale or when query variation is high.

For my use case — multiple users asking similar questions through different interfaces — it’s been worth it. The speed improvement alone justifies the effort, even ignoring cost savings.

Key Takeaways

Semantic caching works best for high-volume, repetitive query patterns with natural language variation
Similarity thresholds need tuning based on your specific query types
Context matters: conversational queries need conversation-aware cache keys
TTL strategies must match query semantics (time-sensitive vs evergreen)
Monitoring is critical because cache effectiveness changes as usage patterns evolve
The infrastructure cost and complexity only make sense above a certain query volume

I’m still tuning threshold values and TTL rules based on observed patterns. This isn’t a set-it-and-forget-it system — it requires ongoing adjustment as my workflows and query patterns change.

Tech Expert & Vibe Coder

Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs

Why I Built This

My Setup and Context

How I Actually Implemented It

Setting Up pgvector

The Caching Logic

What I Learned About Similarity Thresholds

Real Performance Numbers

Problems I Hit

Stale Cache Data

Context-Dependent Queries

Embedding API Latency

When This Doesn’t Work

Monitoring and Maintenance

Cost-Benefit Reality Check

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built This

My Setup and Context

How I Actually Implemented It

Setting Up pgvector

The Caching Logic

What I Learned About Similarity Thresholds

Real Performance Numbers

Problems I Hit

Stale Cache Data

Context-Dependent Queries

Embedding API Latency

When This Doesn’t Work

Monitoring and Maintenance

Cost-Benefit Reality Check

Key Takeaways

Running Multiple Quantization Levels of the Same Model: Dynamic VRAM Allocation in Ollama for Speed vs Quality Tradeoffs

Debugging Memory Leaks in Long-Running Ollama Instances: Monitoring VRAM Fragmentation and Implementing Automatic Model Reloads

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.