Building AI-powered mathematical proof assistants with local LLMs: implementing theorem verification workflows inspired by autonomous problem solving

Why I Built a Local Math Proof Assistant

I’ve been running local LLMs for about a year now, mostly for text processing and code generation. But I kept wondering: could these models actually help verify mathematical reasoning, not just generate plausible-sounding math that’s often wrong?

The trigger was debugging a graph algorithm where I needed to prove a loop invariant held. I spent two hours convincing myself the logic was sound. That’s when I thought: what if I could offload some of this verification work to a local system that could check my reasoning step-by-step?

Building AI-powered mathematical proof assistants with local LLMs: implementing theorem verification workflows inspired by autonomous problem solving

I wasn’t trying to build a full theorem prover like Lean or Coq. I wanted something practical: a tool that could take my informal mathematical arguments, break them down, and flag potential gaps or errors. Something that runs on my hardware, with no API costs or data leaving my network.

My Setup

I run Proxmox with several VMs dedicated to AI workloads. For this project, I allocated:

Ubuntu 22.04 VM with 32GB RAM
NVIDIA RTX 4090 passed through via PCIe
Local storage on NVMe for model files and proof databases

The core stack:

Ollama for serving local LLMs (I tested Llama 3.1 70B and Qwen2.5-Math models)
Python for orchestration and proof workflow logic
SQLite for storing proof attempts, intermediate steps, and verification results
n8n for automating the submission and review cycle

I chose Ollama because it handles model loading and inference without me managing CUDA libraries directly. The Qwen2.5-Math models caught my attention because they’re specifically trained on mathematical reasoning, not just general text.

The Workflow I Built

The system works in stages, inspired by how I actually verify proofs manually:

1. Problem Decomposition

I feed the LLM a mathematical statement or proof outline. The first pass asks it to break down the argument into discrete logical steps. Each step should be a single claim or deduction.

For example, if I’m proving “the sum of two even numbers is even,” the decomposition might look like:

Assume a and b are even integers
Then a = 2m for some integer m
And b = 2n for some integer n
Therefore a + b = 2m + 2n
Factor: a + b = 2(m + n)
Since m + n is an integer, a + b is even

The LLM generates this breakdown, and I store each step with an ID in SQLite.

2. Step Validation

For each step, I query the LLM again with a focused prompt: “Is this deduction valid given the previous steps? What assumptions does it rely on?”

This is where local models struggle. They’ll often say “yes, this is valid” without actually checking. To work around this, I implemented a verification pattern:

Ask the model to explain why the step follows
Check if the explanation references the correct prior steps
Look for hedging language (“probably”, “might”, “could be”)

If the explanation is weak or contradictory, I flag the step for manual review.

3. Counterexample Generation

This was the most useful addition. For each claim, I ask the LLM: “Can you construct a counterexample that would make this step false?”

If the model generates a plausible counterexample, that’s a red flag. I found this catches errors the validation step misses, especially in edge cases.

4. Proof Reconstruction

Once all steps are validated (or flagged), I have the LLM attempt to reconstruct the full proof from the verified steps. If it can’t, or if the reconstruction introduces new steps, I know there’s a gap in the original argument.

What Worked

The system is genuinely useful for catching basic logical errors. I tested it on about 30 proofs from undergraduate discrete math and algorithm correctness arguments. It successfully flagged:

Missing base cases in induction proofs
Unjustified assumptions (e.g., assuming commutativity without stating it)
Steps that reference undefined variables
Circular reasoning where a step assumes what it’s trying to prove

The counterexample generation is surprisingly good. For a claim like “all prime numbers are odd,” the model immediately suggested 2 as a counterexample. It doesn’t always work, but when it does, it saves me from embarrassing mistakes.

Running everything locally means I can iterate quickly. I make a change to the proof, re-run the workflow, and get feedback in under a minute for simple proofs.

What Didn’t Work

The biggest limitation: local LLMs are not formal proof checkers. They don’t understand mathematical rigor the way systems like Lean or Isabelle do. They pattern-match on what “looks like” valid reasoning.

Specific failures I hit:

False Positives on Validation

The model will often approve invalid steps if they’re phrased confidently. For instance, I intentionally included a step that claimed “since f(x) is continuous, it must be differentiable.” The LLM validated it without question, even though that’s obviously false.

I partially mitigated this by adding a “challenge mode” prompt that explicitly asks the model to find flaws. But it’s not reliable.

Inconsistent Counterexample Quality

Sometimes the model generates counterexamples that are themselves incorrect. I had a proof about graph connectivity where the LLM suggested a counterexample graph that didn’t satisfy the problem’s constraints. I had to manually verify every counterexample, which defeats the purpose.

Performance on Complex Proofs

For anything beyond undergraduate-level math, the system falls apart. I tested it on a proof involving measure theory, and the LLM just started making up definitions and theorems. It couldn’t track the dependencies between lemmas or understand when a result required a previous theorem.

Resource Usage

Running Llama 3.1 70B locally is slow. Each validation step takes 5-10 seconds on my RTX 4090. For a proof with 20 steps, that’s several minutes. The smaller Qwen2.5-Math 7B model is faster but less accurate.

I tried running multiple validation passes in parallel, but that maxed out VRAM and caused OOM errors. I ended up batching steps sequentially, which is tedious.

The n8n Automation Layer

I use n8n to orchestrate the full workflow. The setup:

A webhook receives proof submissions (plain text or LaTeX)
n8n triggers the decomposition script
Each step is queued for validation
Results are stored in SQLite
A summary report is generated and emailed to me

This lets me submit proofs from anywhere on my network and review results later. I also set up a simple web interface (just a form that posts to the webhook) so I don’t have to SSH in every time.

The automation helps because I can batch-process multiple proofs overnight. I submit 5-6 proofs before bed, and by morning, I have validation reports waiting.

Real-World Use Case

I used this system while working on a distributed consensus algorithm. I needed to prove that a certain safety property held under specific network conditions. The proof involved several lemmas about message ordering and state transitions.

The LLM caught two issues:

I assumed message delivery was FIFO without stating it as a precondition
One lemma’s proof skipped a case where two nodes could be in the same state simultaneously

Neither was obvious to me on first read. The system didn’t solve the problems, but it pointed me to the gaps, which saved hours of debugging later.

Key Takeaways

This is not a replacement for formal verification tools. If you need real mathematical rigor, use Lean, Coq, or Isabelle. Those systems actually prove correctness.

What I built is more like a “proof linter.” It catches common mistakes and forces me to think through each step carefully. The act of decomposing a proof into LLM-checkable steps is itself valuable—it makes me explicit about assumptions I’d otherwise gloss over.

Local LLMs are good enough for this use case, but barely. The quality gap between local models and GPT-4 is noticeable. I tested the same proofs with OpenAI’s API, and it caught errors my local setup missed. But I’m not willing to send my work to external APIs, so I accept the trade-off.

The system works best for:

Algorithm correctness arguments
Discrete math proofs (graph theory, combinatorics)
Basic induction and recursion proofs

It fails for:

Advanced pure mathematics
Proofs requiring deep domain knowledge
Anything where formal semantics matter

If you’re considering building something similar, start small. Pick a narrow domain you work in regularly and test whether the LLM can actually provide useful feedback. Don’t expect it to understand math the way a human does—it’s pattern-matching, not reasoning.

I still use this tool regularly, but I treat its output as suggestions, not verdicts. It’s another layer of review, not a source of truth.

Tech Expert & Vibe Coder

Building AI-powered mathematical proof assistants with local LLMs: implementing theorem verification workflows inspired by autonomous problem solving

Why I Built a Local Math Proof Assistant

My Setup