# Building AI-Assisted Code Review Pipelines: Using Local LLMs to Validate Pull Requests Before Human Review
## Why I Built This
I maintain several projects where pull requests sit for days before anyone reviews them. Not because the code is complex, but because the initial scan for obvious issues takes mental energy. Things like inconsistent naming, missing error handling, or undocumented breaking changes.
I wanted a first-pass filter that could catch the mechanical stuff before I spent time on the actual logic review. The constraint: it had to run locally. I don’t send my code or my team’s code to external APIs. Privacy matters, and I don’t trust third-party services with source code.
This isn’t about replacing human review. It’s about reducing the cognitive load of the first pass so I can focus on what actually matters: architecture decisions, edge cases, and whether the code solves the right problem.
## My Setup
I run this on a Proxmox VM with a dedicated GPU passthrough. The VM has 16GB RAM and 6 CPU cores allocated. For the LLM, I use Ollama running locally with the CodeLlama 13B model. I chose CodeLlama because it’s specifically trained on code and runs reasonably fast on my hardware.
The pipeline itself is a Python script triggered by Gitea webhooks. When someone opens a pull request, Gitea sends a webhook to my n8n instance, which triggers the review script. The script pulls the diff, sends it to the local LLM, and posts the results as a comment on the pull request.
I host my Git repositories on Gitea running in Docker on the same Proxmox host. This keeps everything local and avoids external dependencies.
## What the Pipeline Actually Does
The script extracts the git diff from the pull request. It doesn’t send the entire codebase—just the changed lines and some surrounding context. This keeps the token count manageable and the response time under a minute.
I use a specific prompt that asks the LLM to look for:
– Obvious bugs like null pointer dereferences or unhandled exceptions
– Inconsistent naming compared to surrounding code
– Missing error handling in new functions
– Undocumented breaking changes to public APIs
– Basic security issues like SQL injection patterns or hardcoded credentials
The prompt explicitly tells the model to skip style nitpicks and focus on functional problems. I don’t want it flagging every missing semicolon or indentation issue. That’s what linters are for.
The LLM’s response gets formatted as a markdown comment and posted directly to the pull request. The comment includes a disclaimer at the top that says “AI-assisted review – verify all findings before acting.”
## What Worked
The biggest win is catching the stuff I’d otherwise miss until much later. Things like:
– A function that returned null in one branch but threw an exception in another, with no documentation explaining why
– A database query that wasn’t parameterized in a new endpoint
– A breaking change to a public method signature with no deprecation notice
These are the kinds of issues that slip through when you’re reviewing code at the end of a long day. The LLM doesn’t get tired or distracted.
Response time is acceptable. Most reviews complete in 30-60 seconds for diffs under 500 lines. That’s fast enough that developers get feedback before they context-switch to something else.
The false positive rate is lower than I expected. Maybe 20% of the flagged issues are non-issues when you understand the full context. That’s annoying but manageable. The alternative—missing real bugs—is worse.
Running everything locally means I don’t worry about rate limits, API costs, or data leaving my network. The setup cost was a few hours of configuration, but the ongoing cost is just electricity.
## What Didn’t Work
The LLM struggles with context that spans multiple files. If a change references a function defined elsewhere, the model often can’t see that relationship. I tried including more context in the prompt, but that blew up the token count and made responses slow.
I also can’t use this for large refactors. Anything over 1000 lines of diff either takes too long or produces generic, useless feedback like “this is a lot of changes, review carefully.” No kidding.
The model occasionally hallucinates issues that don’t exist. It’ll claim a variable isn’t defined when it clearly is, or suggest a function doesn’t exist when it’s right there in the diff. This happens maybe 5% of the time, but it’s frustrating enough that I had to add that disclaimer to every comment.
Prompt engineering took longer than I expected. My first attempts produced either overly verbose responses or useless one-liners. Getting the right balance of detail and brevity required a dozen iterations.
The pipeline doesn’t integrate with GitHub or GitLab. I use Gitea, which has decent webhook support but fewer third-party integrations. If you’re on GitHub, you’d need to adapt this to work with their API and comment format.
## The Actual Implementation
The core script is about 200 lines of Python. It uses the `requests` library to interact with Gitea’s API and the `ollama` Python package to talk to the local LLM.
When a webhook arrives, the script:
1. Fetches the pull request details from Gitea
2. Extracts the diff using `git diff` via subprocess
3. Constructs the prompt with the diff and instructions
4. Sends it to Ollama’s API endpoint (running on localhost:11434)
5. Parses the response and formats it as markdown
6. Posts the comment back to Gitea
I run this script as a systemd service so it starts automatically if the VM reboots. The service file is simple—just points to the Python script and sets the working directory.
The n8n workflow that triggers this is equally straightforward. It listens for Gitea webhooks, filters for pull request events, and makes an HTTP POST to the script’s endpoint. I could have skipped n8n and had Gitea call the script directly, but n8n gives me better logging and error handling.
## Resource Usage
The VM idles at about 2GB RAM usage with Ollama loaded. During a review, it spikes to 8-10GB depending on the diff size. CPU usage jumps to 80-90% for the duration of the inference, which is why I gave it dedicated cores.
GPU passthrough helps a lot. Without it, inference times were 3-5 minutes, which is too slow to be useful. With the GPU, most reviews finish in under a minute.
Power consumption is negligible—maybe 50 watts extra when the GPU is active. That’s less than $5/month at my electricity rates.
## Limitations I Accept
This doesn’t replace human review. It catches mechanical issues, but it can’t evaluate whether the code is solving the right problem or if the approach makes sense. Those require understanding the broader system and the business requirements.
It also doesn’t understand project-specific conventions unless I encode them in the prompt. Things like “we always use dependency injection for database connections” or “all API responses must include a request ID.” I could add these to the prompt, but that makes it longer and slower.
The model sometimes misses issues that a human would catch immediately. For example, it rarely flags performance problems like O(n²) loops or unnecessary database queries. It’s better at catching correctness issues than efficiency issues.
## Key Takeaways
Running LLMs locally for code review is practical if you have the hardware. You don’t need a datacenter—a consumer GPU and 16GB RAM are enough for useful results.
The value isn’t in perfect accuracy. It’s in catching the obvious stuff so you can focus on the hard parts of review. If it flags 10 things and 2 are real issues, that’s still 2 issues I didn’t have to find manually.
Prompt engineering matters more than model size. A well-crafted prompt with a 13B model beats a generic prompt with a 70B model. I spent more time tuning the prompt than I did setting up the infrastructure.
Don’t expect this to work out of the box. Every codebase has its own patterns and conventions. You’ll need to iterate on the prompt and adjust the filters until it produces useful feedback for your specific projects.
If you send code to external APIs, you’re trusting someone else with your intellectual property. Running locally costs more upfront but eliminates that risk entirely.