Why I Started Tracking Token Costs
I run several self-hosted LLM instances on my Proxmox cluster—mostly for internal automation, content processing, and some experimental tools. What started as "just spin up a container and use it" quickly became a problem: I had no idea what was actually consuming resources or costing me compute time.
My setup uses a mix of local models (Llama variants via Ollama) and occasionally a small OpenAI-compatible API for specific tasks. Without proper tracking, I was blind to usage patterns. Was my n8n automation hammering the API? Were certain workflows inefficient? I didn't know.
That's when I decided to implement token-based cost tracking using tools I already had: Prometheus for metrics collection and Grafana for visualization.
My Actual Setup
Here's what I'm working with:
- Proxmox host running multiple LXC containers
- Ollama instance serving local models (Llama 3.2, Mistral)
- A small Docker container running a custom FastAPI wrapper for LLM calls
- Prometheus already deployed for monitoring other services
- Grafana connected to Prometheus for dashboards
The goal was simple: every time an LLM API call happens, log the token count (input + output) and expose it as a metric Prometheus could scrape.
Building the Token Counter
I wrote a Python middleware layer that wraps around my LLM API calls. It intercepts requests, counts tokens using the tiktoken library (for OpenAI-compatible models) or estimates them for local models, then exposes the data via a /metrics endpoint.
Here's the core logic I used:
from prometheus_client import Counter, Histogram, generate_latest
import tiktoken
# Define metrics
token_counter = Counter('llm_tokens_total', 'Total tokens processed', ['model', 'type'])
request_duration = Histogram('llm_request_duration_seconds', 'Request duration', ['model'])
def count_tokens(text, model="gpt-3.5-turbo"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def process_request(prompt, response, model):
input_tokens = count_tokens(prompt, model)
output_tokens = count_tokens(response, model)
token_counter.labels(model=model, type='input').inc(input_tokens)
token_counter.labels(model=model, type='output').inc(output_tokens)
This runs inside my FastAPI container. Every request increments the counters, and Prometheus scrapes http://api-container:8000/metrics every 15 seconds.
Prometheus Configuration
I added a scrape target in my prometheus.yml:
scrape_configs:
- job_name: 'llm_api'
static_configs:
- targets: ['192.168.1.50:8000']
scrape_interval: 15s
Nothing fancy. Prometheus just pulls the metrics and stores them in its time-series database.
Grafana Dashboard
I created a simple dashboard with these panels:
- Total tokens processed (input vs output) over time
- Tokens per model (to see which models I use most)
- Request rate and duration
- Estimated cost (calculated based on token counts and hypothetical pricing)
For the cost calculation, I used a variable in Grafana to set a "cost per 1M tokens" value. Even though I'm self-hosting, this helps me understand what the equivalent cloud cost would be.
The PromQL query for total tokens looked like this:
sum(rate(llm_tokens_total[5m])) by (model, type)
This shows the rate of token consumption, broken down by model and whether it's input or output.
What Worked
The setup was surprisingly straightforward once I had the middleware in place. Prometheus and Grafana are tools I already use for monitoring other services, so adding LLM metrics felt natural.
Key wins:
- I immediately saw that one of my n8n workflows was making redundant API calls—fixed that and cut token usage by 40%
- Output tokens were consistently higher than input tokens, which meant my prompts were efficient but responses were verbose (I adjusted temperature settings)
- I could compare local model performance vs cloud APIs in terms of throughput
The Grafana dashboard gave me visibility I didn't have before. I could see spikes in usage and correlate them with specific automation runs.
What Didn't Work
Token estimation for local models was inconsistent. Ollama doesn't expose token counts directly, so I had to estimate based on character count and a rough tokens-per-character ratio. This worked for ballpark numbers but wasn't precise.
I also initially tried to track tokens at the application level (inside n8n workflows) but quickly realized that was messy. Centralizing it in the API layer made more sense, but it required wrapping every LLM call through my custom endpoint.
Another issue: Prometheus retention. I'm storing metrics for 30 days, which is fine for trends, but if I want long-term cost analysis, I'd need to export data to something like InfluxDB or just increase retention (at the cost of disk space).
Key Takeaways
If you're self-hosting LLMs and want to understand usage patterns, token tracking is essential. You don't need a fancy observability stack—Prometheus and Grafana are enough.
Here's what I learned:
- Wrap your LLM API calls in a layer that exposes metrics. Don't try to track usage in multiple places.
- Use Prometheus counters for cumulative token counts and histograms for request durations.
- Grafana dashboards make it easy to spot inefficiencies—I found several workflows that were wasteful.
- Token estimation for local models is imperfect. If you need exact counts, you'll need to modify the model server itself.
- Cost tracking (even hypothetical) helps justify self-hosting decisions. I can now compare my setup to cloud pricing.
This setup isn't perfect, but it gives me the visibility I need. I know what's running, how much it's costing (in compute terms), and where I can optimize. That's enough for my use case.