Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Running Qwen 2.5 30B on Raspberry Pi 5 with n8n: Building a Local AI Assistant Workflow Under 8GB RAM

Why I Decided to Run a Local AI Assistant on a Raspberry Pi 5

I've been running n8n workflows on my Raspberry Pi 5 for months now, mostly for task automation and data processing. The Pi 5 with 8GB RAM has been surprisingly capable—fast enough for development work and stable enough to leave running 24/7. When I started experimenting with local AI models, I wanted to see if I could combine both: run a reasonably powerful language model alongside n8n without needing a separate machine or cloud service.

The goal was simple: build a workflow where n8n could send prompts to a local AI model, get responses back, and use those responses in automated tasks. No API keys, no external dependencies, no latency from cloud services. Just my Pi, running everything locally on my home network.

I chose Qwen 2.5 30B because it's one of the larger models that still fits within tight memory constraints when quantized. I wasn't sure if it would actually work under 8GB RAM, but I wanted to find out.

My Setup: Hardware and Software Stack

Here's exactly what I used:

  • Raspberry Pi 5 with 8GB RAM
  • 128GB microSD card (class 10, though I'd recommend an NVMe SSD for better performance)
  • Raspberry Pi OS Lite (64-bit, Debian-based)
  • Docker and Docker Compose for containerized services
  • n8n running in a Docker container
  • Ollama for running the Qwen 2.5 30B model locally

I already had n8n running on the Pi from earlier automation projects. Adding Ollama was the new part. Ollama is a tool for running large language models locally—it handles model downloads, quantization, and provides a simple HTTP API that n8n can talk to.

Installing Ollama on the Raspberry Pi 5

I installed Ollama directly on the Pi, not in a container. This was partly because I wanted to avoid Docker-in-Docker complications and partly because Ollama's installation is straightforward.

The installation command from Ollama's official site worked without issues:

curl -fsSL https://ollama.com/install.sh | sh

After installation, Ollama runs as a service and listens on localhost:11434 by default. I verified it was running with:

systemctl status ollama

Everything looked good, so I moved on to pulling the model.

Downloading and Running Qwen 2.5 30B

This is where memory constraints became real. The full Qwen 2.5 30B model is far too large for 8GB of RAM. I needed a heavily quantized version—specifically, a 4-bit quantization (Q4_K_M) which reduces model size significantly while keeping reasonable quality.

I pulled the model using:

ollama pull qwen2.5:30b-instruct-q4_K_M

The download took about 20 minutes over my home network. The model file itself is around 17GB, which fits on the microSD card but would be faster on an SSD.

Once downloaded, I tested it directly from the command line:

ollama run qwen2.5:30b-instruct-q4_K_M

It worked. The model loaded into memory, and I could interact with it via the terminal. Response times were slower than I'd like—around 10-15 seconds per response—but it was functional. Memory usage hovered around 6-7GB during inference, leaving just enough headroom for the OS and n8n.

Connecting n8n to Ollama

n8n doesn't have a dedicated Ollama node, but it does have an HTTP Request node, which is all I needed. Ollama exposes a REST API that accepts JSON payloads with prompts and returns generated text.

I created a simple workflow in n8n:

  1. Webhook node to receive incoming prompts (for testing)
  2. HTTP Request node configured to POST to http://localhost:11434/api/generate
  3. Set node to extract and format the response

The HTTP Request node payload looked like this:

{
  "model": "qwen2.5:30b-instruct-q4_K_M",
  "prompt": "{{ $json.prompt }}",
  "stream": false
}

I set stream to false because I wanted complete responses, not token-by-token streaming. n8n's HTTP Request node doesn't handle streaming well anyway.

The first test worked. I sent a prompt via the webhook, and n8n forwarded it to Ollama, which returned a response. The workflow took about 20 seconds end-to-end, most of that being model inference time.

What Worked (and What I Learned)

Running Qwen 2.5 30B on a Raspberry Pi 5 with 8GB RAM is possible, but it has clear limitations:

  • Memory usage is tight. With the model loaded, I had less than 1GB of free RAM. Running other heavy processes at the same time caused slowdowns.
  • Inference is slow. 10-15 seconds per response is manageable for background automation tasks but not for interactive chat.
  • Model quality is acceptable. The 4-bit quantization doesn't destroy the model's usefulness. It's still coherent and capable of following instructions, though I noticed occasional repetition and less nuanced phrasing compared to higher-precision versions.
  • n8n integration is straightforward. Once Ollama is running, connecting it to n8n is just a matter of HTTP requests. No special nodes or plugins needed.

I also learned that running the model from an SSD instead of a microSD card would help. The Pi 5 supports NVMe over PCIe, and loading the model from faster storage would reduce startup time and improve responsiveness.

What Didn't Work

I tried running the model in a Docker container alongside n8n, thinking it would be cleaner. It didn't work well. Docker added overhead, and networking between containers introduced latency. Installing Ollama directly on the host was simpler and more reliable.

I also tried using a larger model—Qwen 2.5 72B—just to see what would happen. It didn't fit in memory, even with aggressive quantization. The Pi 5 maxed out and became unresponsive. I had to reboot and stick with the 30B version.

Streaming responses didn't work as I'd hoped. Ollama supports streaming, but n8n's HTTP Request node buffers the entire response before passing it along. For interactive use cases, this kills the benefit of streaming.

Real-World Use Case: Automated Content Summarization

I built a workflow that monitors an RSS feed, extracts article text, sends it to the local AI model for summarization, and stores the summary in a database. The whole process runs automatically every hour.

Here's the workflow structure:

  1. Cron node triggers every hour
  2. RSS Read node fetches new articles
  3. HTML Extract node pulls article content
  4. HTTP Request node sends content to Ollama with a summarization prompt
  5. Postgres node stores the summary

It works reliably. The Pi handles the workload without issues, and I get summaries of technical articles without depending on external APIs or paying per request.

Key Takeaways

  • Running a 30B parameter model on a Raspberry Pi 5 with 8GB RAM is possible with heavy quantization.
  • Inference is slow but usable for background automation tasks.
  • n8n integrates easily with Ollama using standard HTTP requests.
  • Memory constraints are real—don't expect to run multiple large models or heavy workloads simultaneously.
  • An SSD instead of a microSD card would improve performance noticeably.
  • This setup is local, private, and doesn't rely on cloud services or API keys.

If you already have a Raspberry Pi 5 and want to experiment with local AI models in your automation workflows, this approach works. It's not fast, but it's functional, and it runs entirely on hardware you control.