Why I Built This
I run several local LLMs on my Proxmox cluster—Llama models, Mistral variants, and a few experimental ones I test periodically. The problem wasn't inference speed or model quality. It was cost and redundancy.
Every time I queried the same prompt across different tools or sessions, the model reprocessed everything from scratch. No memory. No reuse. Just wasted compute cycles and API costs when I occasionally used cloud-hosted models for comparison.
I needed a caching layer that sat between my applications and the models—something lightweight, fast, and cheap to run. Cloudflare Workers became the obvious choice because they're serverless, globally distributed, and I was already using Cloudflare for DNS and routing.
My Setup
The gateway runs on Cloudflare Workers with R2 for object storage. Here's what I actually deployed:
- Cloudflare Workers: Handles incoming requests, checks cache, forwards to LLM if needed
- R2 bucket: Stores cached responses keyed by prompt hash
- Wrangler CLI: For local development and deployment
- TypeScript: Because I wanted type safety and cleaner error handling
I don't use Cloudflare's Workers AI binding because my models run locally or through external APIs (OpenAI, Anthropic for testing). The Worker acts purely as a proxy and cache layer.
The Core Logic
When a request comes in:
- Hash the prompt and model parameters to generate a cache key
- Check R2 for an existing response
- If found, return immediately (cache hit)
- If not, forward the request to the actual LLM endpoint
- Store the response in R2 before returning it
- Set a TTL (time-to-live) on cached entries to prevent stale data
The hash includes the full prompt, temperature, max tokens, and model name. If any of those change, it's treated as a new request.
Configuration I Used
My wrangler.toml looks like this:
name = "llm-gateway" main = "src/index.ts" compatibility_date = "2024-01-15" [[r2_buckets]] binding = "CACHE_BUCKET" bucket_name = "llm-responses" preview_bucket_name = "llm-responses-preview" [vars] CACHE_TTL = "86400" MAX_CACHE_SIZE = "5242880"
I set the TTL to 24 hours because most of my queries are exploratory and I don't mind day-old responses for repeated questions. MAX_CACHE_SIZE limits individual cached responses to 5MB—larger responses don't get cached.
The Worker Code (Simplified)
This is the actual structure I deployed, stripped of error handling and logging for clarity:
export default {
async fetch(request, env) {
const { prompt, model, temperature, max_tokens } = await request.json();
const cacheKey = await generateHash({
prompt,
model,
temperature,
max_tokens
});
const cached = await env.CACHE_BUCKET.get(cacheKey);
if (cached) {
return new Response(await cached.text(), {
headers: {
'Content-Type': 'application/json',
'X-Cache': 'HIT'
}
});
}
const response = await fetch(getLLMEndpoint(model), {
method: 'POST',
headers: { 'Authorization': `Bearer ${env.API_KEY}` },
body: JSON.stringify({ prompt, temperature, max_tokens })
});
const data = await response.text();
if (data.length < env.MAX_CACHE_SIZE) {
await env.CACHE_BUCKET.put(cacheKey, data, {
expirationTtl: env.CACHE_TTL
});
}
return new Response(data, {
headers: {
'Content-Type': 'application/json',
'X-Cache': 'MISS'
}
});
}
};
The X-Cache header tells me whether a response came from R2 or required a fresh inference call.
What Worked
Cache hit rate exceeded expectations. Within the first week, I saw a 60% hit rate on repeated queries. That number climbed to 75% after a month because I tend to iterate on prompts with small variations.
Latency dropped significantly. Cache hits return in under 50ms. Fresh inference calls to my local models take 2-4 seconds depending on prompt complexity. The difference is noticeable when testing multiple variations of the same query.
R2 costs are negligible. I'm storing about 2GB of cached responses and paying less than $0.50/month. Cloudflare's free tier covers most of my Worker invocations (100k requests/day).
Hashing proved reliable. I use SHA-256 on a JSON string of the request parameters. No collisions so far, and lookups are instant.
Monitoring and Debugging
I added simple logging to track cache performance:
console.log({
timestamp: Date.now(),
cacheKey,
hit: cached !== null,
model,
promptLength: prompt.length
});
This gets piped to Cloudflare's real-time logs, which I pull into my local monitoring stack (Prometheus + Grafana). The dashboard shows hit rate, average response time, and cache size over time.
What Didn't Work
Streaming responses broke caching. My first implementation tried to cache streaming LLM responses (SSE format). It failed because Workers can't easily intercept and store streamed data without buffering the entire response in memory first. I disabled caching for streaming requests entirely.
Cache invalidation is manual. There's no automatic way to detect when a cached response becomes outdated. If I update a model or change its behavior, stale responses linger until the TTL expires. I've resorted to manually purging the R2 bucket when I make significant model changes.
Large responses don't cache well. Responses over 5MB (my arbitrary limit) bypass the cache entirely. This happens with long-context prompts or multi-turn conversations. I haven't found a clean solution yet—splitting responses into chunks adds complexity I don't want.
API key management is clunky. I store LLM API keys as Wrangler secrets, but rotating them requires redeployment. I should move to a proper secrets manager, but I haven't prioritized it.
Debugging Cold Starts
Workers have virtually no cold start penalty, but I did notice occasional 200-300ms delays on the first request after idle periods. This turned out to be R2 connection overhead. Adding a simple warmup request on deployment mitigated it:
await env.CACHE_BUCKET.head('warmup-key');
Not elegant, but it works.
Key Takeaways
Caching is worth it for repetitive workloads. If you're testing prompts, iterating on outputs, or running the same queries across sessions, a cache layer pays for itself immediately.
Workers + R2 is a good fit for this use case. The combination is cheap, fast, and requires minimal maintenance. I've had zero downtime since deploying.
Hashing is simple and reliable. Don't overthink cache key generation. A good hash function and consistent serialization are enough.
Streaming and caching don't mix easily. If you need streaming, accept that those requests won't benefit from caching.
Monitor cache performance. Without metrics, you won't know if the cache is helping or just adding latency. Log hits, misses, and response times.
Trade-offs I Accept
This setup prioritizes speed and cost over flexibility. It works for my use case—testing and iterating on prompts—but wouldn't suit production systems that need:
- Real-time model updates
- Complex invalidation logic
- Multi-user access control
- Streaming support
I'm okay with those limitations because they don't affect what I'm building.
Current State
The gateway has been running for three months. It handles about 2,000 requests per day (mostly from my own testing) and maintains a 70% cache hit rate. R2 storage sits at 2.3GB and costs are under $1/month total.
I've extended it to support multiple LLM backends (local models, OpenAI, Anthropic) by routing based on model name. The core caching logic hasn't changed.
It's not perfect, but it solves the problem I had: reducing redundant inference costs while keeping latency low.