Creating a self-hosted AI hallucination detector for your documentation wiki using Ollama and Python

Why I Built a Hallucination Detector for My Documentation Wiki

I run a self-hosted documentation wiki on Proxmox. It stores configuration notes, troubleshooting steps, and network diagrams for everything I maintain. Over time, I started using LLMs to generate summaries and fill in gaps where my notes were incomplete.

Then I caught a problem. An AI-generated network diagram included a VLAN I’d never configured. The model had invented it based on patterns it saw elsewhere in my documentation. If I hadn’t cross-checked manually, that fake VLAN would have stayed in my wiki, confusing me months later when I actually needed to reference it.

That’s when I decided to build a simple hallucination detector. Not for production use, not for clients—just for my own documentation system. The goal was to catch fabricated details before they became part of my permanent reference material.

My Setup: Ollama, Python, and a Local Wiki

My documentation wiki runs as a Docker container on Proxmox. It’s nothing fancy—just a markdown-based system I can edit through a web interface. I already had Ollama running locally for other experiments, so I didn’t need to add new infrastructure.

The detection script runs on the same Proxmox host. It’s a Python script that:

Pulls markdown files from the wiki’s storage volume
Sends each document to Ollama with a specific prompt
Logs any sections flagged as potentially fabricated

I use llama3.2:3b for detection because it runs fast enough on my hardware and doesn’t require GPU acceleration. The script outputs a simple JSON file with flagged sections and confidence scores.

What the Script Actually Does

The Python script reads each markdown file, splits it into logical sections (by heading), and sends each section to Ollama with this prompt:

You are reviewing technical documentation. Identify any claims, configurations, or details that appear fabricated, inconsistent with standard practices, or unsupported by context. Return only the specific text that seems invented, with a brief explanation.

Ollama returns a response, which I parse and log. If the model flags something, I manually review it. The script doesn’t auto-delete anything—it just highlights what looks suspicious.

What Worked

The detector caught three types of problems I didn’t expect:

Invented configuration values: Port numbers, IP ranges, or service names that don’t exist in my actual setup
Contradictory statements: One section saying a service runs on Docker, another saying it’s a VM
Overly confident claims: Phrases like “this always works” or “guaranteed uptime” in areas where I’d actually noted caveats

The most useful catch was a DNS configuration note that referenced a resolver I’d tested but never deployed. The AI had merged two separate experiments into one fictional setup. Without the detector, I would have wasted time trying to debug why “my configuration” didn’t match reality.

Why Local Models Matter Here

I chose Ollama specifically because I didn’t want to send my internal documentation to external APIs. My notes include IP addresses, service credentials, and network topology details. Running detection locally means nothing leaves my Proxmox host.

The trade-off is accuracy. Smaller local models miss nuances that GPT-4 would catch. But for my use case, I’d rather have 70% accuracy with full privacy than 95% accuracy with data sent to OpenAI.

What Didn’t Work

The first version of the script tried to validate every sentence individually. This produced too many false positives. The model flagged normal technical statements as “unsupported” because it lacked surrounding context.

I fixed this by processing entire sections instead of individual sentences. This gave the model enough context to distinguish between legitimate technical details and fabricated ones.

Another failure: I initially tried using confidence scores to auto-filter results. Anything below 70% confidence was ignored. This backfired because some of the most obvious hallucinations came back with low confidence scores. The model was uncertain, but still correct about the problem.

Now I review everything the script flags, regardless of confidence level. It takes more time, but I don’t miss real issues.

Performance Limitations

Running detection on my entire wiki (about 200 documents) takes roughly 15 minutes. That’s acceptable for a weekly batch job, but too slow for real-time validation. If I wanted to check documents as I wrote them, I’d need faster hardware or a smaller model.

I also learned that the detector struggles with highly technical jargon. If a section uses obscure Proxmox or Docker terminology, the model sometimes flags it as suspicious even when it’s correct. This is a known limitation of smaller models—they haven’t seen enough specialized technical content during training.

How I Actually Use This

I run the detection script weekly via a cron job. The output goes to a log file I review every Sunday. If something is flagged, I open the original document and manually verify the claim against my actual configuration.

Most weeks, the script finds nothing. But when it does flag something, it’s usually worth investigating. Even false positives force me to clarify ambiguous notes or add missing context.

The script has also changed how I write new documentation. I’m more careful about stating things as fact versus noting them as experiments or plans. If I’m not sure about a configuration detail, I explicitly mark it as uncertain rather than letting an LLM fill in the gap later.

Key Takeaways

This project taught me that hallucination detection doesn’t need to be complex to be useful. A simple Python script and a local LLM caught real problems in my documentation that I would have missed otherwise.

The most important lesson: don’t trust AI-generated content in your reference materials without validation. Even small fabrications compound over time, especially in technical documentation where precision matters.

If you run a self-hosted wiki or documentation system and use LLMs to help maintain it, consider adding a basic detection layer. It doesn’t need to be perfect. It just needs to flag the obvious problems before they become permanent.

What I’d Change Next

If I rebuild this, I’d add source tracking. Right now, the script tells me what’s wrong but not where the fabricated content came from. Was it AI-generated? Did I copy it from an old note? Knowing the source would help me fix the underlying process, not just the symptoms.

I’d also experiment with running two models in parallel—one for detection, one for verification. If both models flag the same section, confidence goes up. If they disagree, I know the issue is ambiguous and needs manual review.

But for now, the current system works. It’s not elegant, but it catches real problems in my actual documentation. That’s all I needed it to do.

Tech Expert & Vibe Coder

Why I Built a Hallucination Detector for My Documentation Wiki