Building a self-hosted LLM gateway with Cloudflare Workers and R2 to cache model responses and reduce inference costs

Why I Built This

I run several local LLMs on my Proxmox cluster—Llama models, Mistral variants, and a few experimental ones I test periodically. The problem wasn’t inference speed or model quality. It was cost and redundancy.

Every time I queried the same prompt across different tools or sessions, the model reprocessed everything from scratch. No memory. No reuse. Just wasted compute cycles and API costs when I occasionally used cloud-hosted models for comparison.

I needed a caching layer that sat between my applications and the models—something lightweight, fast, and cheap to run. Cloudflare Workers became the obvious choice because they’re serverless, globally distributed, and I was already using Cloudflare for DNS and routing.

My Setup

The gateway runs on Cloudflare Workers with R2 for object storage. Here’s what I actually deployed:

Cloudflare Workers: Handles incoming requests, checks cache, forwards to LLM if needed
R2 bucket: Stores cached responses keyed by prompt hash
Wrangler CLI: For local development and deployment
TypeScript: Because I wanted type safety and cleaner error handling

I don’t use Cloudflare’s Workers AI binding because my models run locally or through external APIs (OpenAI, Anthropic for testing). The Worker acts purely as a proxy and cache layer.

The Core Logic

When a request comes in:

Hash the prompt and model parameters to generate a cache key
Check R2 for an existing response
If found, return immediately (cache hit)
If not, forward the request to the actual LLM endpoint
Store the response in R2 before returning it
Set a TTL (time-to-live) on cached entries to prevent stale data

The hash includes the full prompt, temperature, max tokens, and model name. If any of those change, it’s treated as a new request.

Configuration I Used

My wrangler.toml looks like this:

name = "llm-gateway"
main = "src/index.ts"
compatibility_date = "2024-01-15"

[[r2_buckets]]
binding = "CACHE_BUCKET"
bucket_name = "llm-responses"
preview_bucket_name = "llm-responses-preview"

[vars]
CACHE_TTL = "86400"
MAX_CACHE_SIZE = "5242880"

I set the TTL to 24 hours because most of my queries are exploratory and I don’t mind day-old responses for repeated questions. MAX_CACHE_SIZE limits individual cached responses to 5MB—larger responses don’t get cached.

The Worker Code (Simplified)

This is the actual structure I deployed, stripped of error handling and logging for clarity:

export default {
  async fetch(request, env) {
    const { prompt, model, temperature, max_tokens } = await request.json();
    
    const cacheKey = await generateHash({
      prompt,
      model,
      temperature,
      max_tokens
    });
    
    const cached = await env.CACHE_BUCKET.get(cacheKey);
    if (cached) {
      return new Response(await cached.text(), {
        headers: { 
          'Content-Type': 'application/json',
          'X-Cache': 'HIT'
        }
      });
    }
    
    const response = await fetch(getLLMEndpoint(model), {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${env.API_KEY}` },
      body: JSON.stringify({ prompt, temperature, max_tokens })
    });
    
    const data = await response.text();
    
    if (data.length < env.MAX_CACHE_SIZE) {
      await env.CACHE_BUCKET.put(cacheKey, data, {
        expirationTtl: env.CACHE_TTL
      });
    }
    
    return new Response(data, {
      headers: { 
        'Content-Type': 'application/json',
        'X-Cache': 'MISS'
      }
    });
  }
};

The X-Cache header tells me whether a response came from R2 or required a fresh inference call.

What Worked

Cache hit rate exceeded expectations. Within the first week, I saw a 60% hit rate on repeated queries. That number climbed to 75% after a month because I tend to iterate on prompts with small variations.

Latency dropped significantly. Cache hits return in under 50ms. Fresh inference calls to my local models take 2-4 seconds depending on prompt complexity. The difference is noticeable when testing multiple variations of the same query.

R2 costs are negligible. I’m storing about 2GB of cached responses and paying less than $0.50/month. Cloudflare’s free tier covers most of my Worker invocations (100k requests/day).

Hashing proved reliable. I use SHA-256 on a JSON string of the request parameters. No collisions so far, and lookups are instant.

Monitoring and Debugging

I added simple logging to track cache performance:

console.log({
  timestamp: Date.now(),
  cacheKey,
  hit: cached !== null,
  model,
  promptLength: prompt.length
});

This gets piped to Cloudflare’s real-time logs, which I pull into my local monitoring stack (Prometheus + Grafana). The dashboard shows hit rate, average response time, and cache size over time.

What Didn’t Work

Streaming responses broke caching. My first implementation tried to cache streaming LLM responses (SSE format). It failed because Workers can’t easily intercept and store streamed data without buffering the entire response in memory first. I disabled caching for streaming requests entirely.

Cache invalidation is manual. There’s no automatic way to detect when a cached response becomes outdated. If I update a model or change its behavior, stale responses linger until the TTL expires. I’ve resorted to manually purging the R2 bucket when I make significant model changes.

Large responses don’t cache well. Responses over 5MB (my arbitrary limit) bypass the cache entirely. This happens with long-context prompts or multi-turn conversations. I haven’t found a clean solution yet—splitting responses into chunks adds complexity I don’t want.

API key management is clunky. I store LLM API keys as Wrangler secrets, but rotating them requires redeployment. I should move to a proper secrets manager, but I haven’t prioritized it.

Debugging Cold Starts

Workers have virtually no cold start penalty, but I did notice occasional 200-300ms delays on the first request after idle periods. This turned out to be R2 connection overhead. Adding a simple warmup request on deployment mitigated it:

await env.CACHE_BUCKET.head('warmup-key');

Not elegant, but it works.

Key Takeaways

Caching is worth it for repetitive workloads. If you’re testing prompts, iterating on outputs, or running the same queries across sessions, a cache layer pays for itself immediately.

Workers + R2 is a good fit for this use case. The combination is cheap, fast, and requires minimal maintenance. I’ve had zero downtime since deploying.

Hashing is simple and reliable. Don’t overthink cache key generation. A good hash function and consistent serialization are enough.

Streaming and caching don’t mix easily. If you need streaming, accept that those requests won’t benefit from caching.

Monitor cache performance. Without metrics, you won’t know if the cache is helping or just adding latency. Log hits, misses, and response times.

Trade-offs I Accept

This setup prioritizes speed and cost over flexibility. It works for my use case—testing and iterating on prompts—but wouldn’t suit production systems that need:

Real-time model updates
Complex invalidation logic
Multi-user access control
Streaming support

I’m okay with those limitations because they don’t affect what I’m building.

Current State

The gateway has been running for three months. It handles about 2,000 requests per day (mostly from my own testing) and maintains a 70% cache hit rate. R2 storage sits at 2.3GB and costs are under $1/month total.

I’ve extended it to support multiple LLM backends (local models, OpenAI, Anthropic) by routing based on model name. The core caching logic hasn’t changed.

It’s not perfect, but it solves the problem I had: reducing redundant inference costs while keeping latency low.

Tech Expert & Vibe Coder

Why I Built This

My Setup

The Core Logic

Configuration I Used

The Worker Code (Simplified)

What Worked

Monitoring and Debugging

What Didn’t Work

Debugging Cold Starts

Key Takeaways

Trade-offs I Accept

Current State

Category:

Debugging Llm Context Window...

Optimizing Sub-20kb Static...

Leave a Comment Cancel reply

Categories

Related Posts

Debugging Llm Context Window Limits in...

Optimizing Sub-20kb Static Sites on Caddy: ...

Building a Self-hosted Icloud Photos Downloader...

About Me

Vipin PG

Tech Expert & Vibe Coder

Building a self-hosted LLM gateway with Cloudflare Workers and R2 to cache model responses and reduce inference costs

Why I Built This

My Setup

The Core Logic

Configuration I Used

The Worker Code (Simplified)

What Worked

Monitoring and Debugging

What Didn’t Work

Debugging Cold Starts

Key Takeaways

Trade-offs I Accept

Current State

Category:

Debugging Llm Context Window...

Optimizing Sub-20kb Static...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Debugging Llm Context Window Limits in...

Optimizing Sub-20kb Static Sites on Caddy: ...

Building a Self-hosted Icloud Photos Downloader...

About Me

Vipin PG