Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Setting Up Prometheus Metrics for LM Studio API Endpoints: Tracking Token Usage and Response Times with Custom Exporters

Why I Built This

I run LM Studio locally on my home server to handle various AI tasks—text processing, summarization, coding assistance. It works well, but I had no visibility into what was actually happening. How many tokens was I burning through? Which models were efficient? Were some requests timing out? I had no data.

I needed metrics, but LM Studio doesn’t expose Prometheus endpoints. I could have logged everything manually in my scripts, but that’s messy and doesn’t scale. I wanted something that sat between my applications and LM Studio, tracked everything transparently, and gave me clean metrics without changing any client code.

So I built a proxy that does exactly that.

My Setup

I run LM Studio on a Proxmox VM with GPU passthrough. It listens on port 1234 and exposes an OpenAI-compatible API. My automation workflows (n8n, Python scripts, custom tools) all hit that endpoint.

The proxy I built runs in a Docker container on the same network. It listens on port 8080 and forwards everything to LM Studio on 1234. Every request passes through it, gets logged to a SQLite database, and the response is sent back to the client unchanged.

I wrote it in Rust because I wanted something fast and lightweight. The proxy doesn’t parse or modify request bodies unless it’s a streaming response—it just passes bytes through and extracts token counts from the responses.

The Flow

Client (n8n, Python script, etc.)
    ↓
Proxy (port 8080) → logs request metadata
    ↓
LM Studio (port 1234) → processes request
    ↓
Proxy → extracts token counts from response
    ↓
Client receives response

The proxy stores:

  • Endpoint hit (/v1/chat/completions, /v1/completions, etc.)
  • Model used
  • Input and output token counts
  • Request duration
  • Timestamp
  • Whether it was streamed or standard
  • Whether it failed

What Worked

Transparent Proxying

I didn’t want to change any client code. The proxy just sits in the middle. I updated my environment variables to point to http://proxy:8080 instead of http://lmstudio:1234, and everything kept working.

The proxy forwards all /v1/* routes to LM Studio. It handles GET, POST, and DELETE. It doesn’t care what the request is—it just passes it through.

Streaming Support

LM Studio supports streaming responses (Server-Sent Events). The proxy had to handle this without breaking the stream. I used reqwest‘s streaming body support to forward chunks as they arrive.

For streaming responses, I collect chunks, parse out the token usage from the final data: [DONE] message, and log it. The client still gets the stream in real time.

SQLite for Storage

I used SQLite because I didn’t want to run Postgres for this. The database file lives in a Docker volume, so it persists across restarts. The schema is simple:

CREATE TABLE requests (
    id INTEGER PRIMARY KEY,
    endpoint TEXT,
    model TEXT,
    input_tokens INTEGER,
    output_tokens INTEGER,
    total_tokens INTEGER,
    duration_ms INTEGER,
    start_time TEXT,
    is_error BOOLEAN,
    was_streamed BOOLEAN
);

I query it with basic SQL. No ORM, no migrations. Just raw queries using sqlx.

Statistics Endpoints

I added three endpoints to query usage:

  • GET /stats/summary — total requests, tokens, average durations
  • GET /stats/by-model — per-model breakdown
  • GET /stats/recent?limit=N — last N requests

I curl these from my monitoring scripts and dump the JSON into Grafana (via a simple exporter I wrote). Now I can see token usage trends over time.

Low Overhead

Rust made this fast. The proxy adds about 2-5ms of latency per request. Most of that is SQLite writes. I don’t parse JSON unless I need to extract token counts, and I don’t buffer entire responses in memory.

What Didn’t Work

Prometheus Metrics (Initially)

I wanted native Prometheus metrics from the start, but I hit a problem: Prometheus doesn’t handle high-cardinality labels well. If I exposed a metric like lmstudio_tokens_total{model="...", endpoint="..."}, the cardinality exploded as I tested different models.

I ended up storing everything in SQLite and querying it on demand instead. I wrote a separate Prometheus exporter that reads from the database and exposes aggregated metrics. It’s not real-time, but it’s good enough for my use case.

Token Extraction from Streaming Responses

Streaming responses don’t always include token counts in every chunk. LM Studio sends a final chunk with usage data, but only if the client doesn’t disconnect early.

If a client cancels mid-stream, I don’t get token counts. I log the request as incomplete and estimate tokens based on chunk sizes. It’s not perfect, but it’s better than nothing.

Error Handling for Non-200 Responses

LM Studio sometimes returns 500 errors if the model crashes or runs out of VRAM. The proxy logs these as errors, but I don’t retry or handle them specially. The client sees the error and deals with it.

I thought about adding retry logic, but that would change behavior. The proxy is supposed to be transparent, so I left it out.

Database Locking Under Load

SQLite uses file-based locking. If I hammer the proxy with concurrent requests, writes can block. I added a connection pool and set busy_timeout to 5 seconds, which helped.

For higher concurrency, I’d switch to Postgres, but I’m not at that scale yet.

How I Use It

Tracking Token Usage in n8n Workflows

I have n8n workflows that call LM Studio for text processing. I pointed them at the proxy instead of LM Studio directly. Now I can query /stats/by-model and see which workflows are using the most tokens.

Monitoring Response Times

I added a Grafana dashboard that pulls from the SQLite database. It shows:

  • Average response time per model
  • Token usage over time
  • Request success rate

I can spot when a model is slow or when token usage spikes.

Debugging Failed Requests

When a request fails, I query /stats/recent and filter for errors. The logs show the endpoint, model, and duration, which helps narrow down the problem.

Key Takeaways

  • A transparent proxy is cleaner than modifying client code.
  • SQLite is fine for this kind of logging if you’re not at high scale.
  • Streaming responses need special handling—don’t assume you’ll always get token counts.
  • Prometheus metrics work better when aggregated, not per-request.
  • Rust’s async runtime (Tokio) handles concurrent requests well without much tuning.

If you’re running local AI models and want visibility into what’s happening, this approach works. It’s simple, doesn’t require changes to your existing tools, and gives you the data you need to optimize usage.

The Code

I open-sourced this as lms-metrics-proxy. The core logic is about 500 lines of Rust. It uses:

  • axum for the HTTP server
  • reqwest for proxying requests
  • sqlx for SQLite
  • tokio for async runtime

The Docker image is self-contained. You mount a volume for the database, set LM_STUDIO_URL in the environment, and it runs.

Running It

docker run -d 
  -p 8080:8080 
  -e LM_STUDIO_URL=http://lmstudio:1234 
  -v ./metrics.db:/app/metrics.db 
  lms-metrics-proxy:latest

Then point your clients to http://localhost:8080 instead of http://localhost:1234.

Querying Stats

# Summary
curl http://localhost:8080/stats/summary

# Per-model breakdown
curl http://localhost:8080/stats/by-model

# Last 50 requests
curl http://localhost:8080/stats/recent?limit=50

The responses are JSON. I pipe them into jq for filtering or dump them into Grafana.

What I’d Change

If I were building this again, I’d consider:

  • Adding a Prometheus exporter directly into the proxy instead of querying SQLite
  • Using a ring buffer for recent requests instead of querying the database
  • Supporting multiple backends (not just LM Studio)
  • Adding request/response body logging (optional, for debugging)

But for my use case, this works. It’s been running for a few months without issues, and I finally have the visibility I needed.

Leave a Comment

Your email address will not be published. Required fields are marked *