Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building a Local RAG Pipeline with Ollama, ChromaDB, and Automatic Document Ingestion

Automation 7 min read Published Mar 19, 2026 Updated Mar 19, 2026

Why I Built This Pipeline

I needed a way to query my own documentation—installation notes, config files, past troubleshooting logs—without uploading anything to external AI services. I run systems that handle internal network details, client data snippets, and infrastructure notes that shouldn’t leave my environment. Cloud-based RAG tools work well, but they require sending your data out. I wanted the same capability, but fully local.

I also wanted automatic ingestion. Dropping a PDF into a folder and having it indexed without manual steps meant I could treat this like any other monitoring or backup workflow—set it once, let it run.

My Setup

I ran this on a Proxmox VM with 16GB RAM and 6 CPU cores allocated. The host has no GPU, so everything runs on CPU. I use Docker Compose to manage services because it keeps dependencies isolated and makes teardown clean.

The stack consists of:

  • Ollama – Hosts the LLM and embedding models locally via HTTP on port 11434
  • ChromaDB – Stores vector embeddings on port 8000
  • Python scripts – Handle document ingestion and query orchestration
  • inotify-tools – Watches a folder for new files and triggers re-indexing

All data stays on the VM. No outbound requests to AI APIs.

Docker Compose Configuration

I created a docker-compose.yml file in /opt/rag-pipeline/:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped

  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    ports:
      - "8000:8000"
    volumes:
      - chroma-data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE
    restart: unless-stopped

volumes:
  ollama-data:
  chroma-data:

After running docker compose up -d, I pulled the models:

docker exec -it ollama ollama pull nomic-embed-text
docker exec -it ollama ollama pull llama3.1:8b

nomic-embed-text is a 137M parameter model designed for embeddings. llama3.1:8b handles text generation. Both run entirely on CPU in my setup, though generation is slower than I’d like—around 8-12 tokens per second depending on prompt size.

Document Ingestion Script

I wrote a Python script (ingest.py) that:

  1. Scans /opt/rag-pipeline/docs/ for PDFs, markdown files, and plain text
  2. Extracts text from each file
  3. Splits text into chunks of 1024 characters with 200-character overlap
  4. Sends each chunk to Ollama’s embedding endpoint
  5. Stores the resulting vectors in ChromaDB

Dependencies I installed:

pip install chromadb ollama pypdf sentence-transformers

The core logic looks like this:

import os
import chromadb
from chromadb.config import Settings
from pypdf import PdfReader
import ollama

# Connect to ChromaDB
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection(name="docs")

# Function to chunk text
def chunk_text(text, chunk_size=1024, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

# Process a single PDF
def ingest_pdf(filepath):
    reader = PdfReader(filepath)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    
    chunks = chunk_text(text)
    
    for i, chunk in enumerate(chunks):
        # Generate embedding
        response = ollama.embeddings(model="nomic-embed-text", prompt=chunk)
        embedding = response["embedding"]
        
        # Store in ChromaDB
        collection.add(
            ids=[f"{os.path.basename(filepath)}_chunk_{i}"],
            embeddings=[embedding],
            documents=[chunk],
            metadatas=[{"source": filepath, "chunk_index": i}]
        )

# Scan docs folder
docs_dir = "/opt/rag-pipeline/docs/"
for filename in os.listdir(docs_dir):
    if filename.endswith(".pdf"):
        ingest_pdf(os.path.join(docs_dir, filename))

This worked, but the first run took about 45 minutes to process 30 PDFs totaling around 800 pages. The bottleneck was embedding generation on CPU.

Query Script

For querying, I wrote query.py:

import chromadb
import ollama

client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_collection(name="docs")

def query_rag(question):
    # Embed the question
    response = ollama.embeddings(model="nomic-embed-text", prompt=question)
    query_embedding = response["embedding"]
    
    # Retrieve top 3 most similar chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    
    context = "\n\n".join(results["documents"][0])
    
    # Build prompt
    prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    
    # Generate answer
    response = ollama.generate(model="llama3.1:8b", prompt=prompt)
    return response["response"]

# Example usage
answer = query_rag("How do I configure DNS for Synology?")
print(answer)

This retrieves the three most relevant chunks based on vector similarity, then feeds them to the LLM as context. The model generates an answer grounded in those chunks.

Automatic Ingestion with Folder Watching

I wanted new documents to be indexed automatically when dropped into the docs/ folder. I used inotifywait from inotify-tools:

#!/bin/bash

WATCH_DIR="/opt/rag-pipeline/docs/"
INGEST_SCRIPT="/opt/rag-pipeline/ingest.py"

inotifywait -m -e create -e moved_to "$WATCH_DIR" |
while read path action file; do
    echo "New file detected: $file"
    python3 "$INGEST_SCRIPT"
done

I saved this as watch_docs.sh, made it executable, and ran it in a tmux session. Now any PDF, markdown, or text file added to docs/ triggers a re-run of ingest.py.

This isn’t the most efficient approach—it re-processes everything instead of just the new file. I modified ingest.py later to check if a file ID already exists in ChromaDB before embedding it again. That reduced redundant work.

What Worked

The pipeline runs entirely offline. I verified this by blocking outbound traffic from the VM temporarily—everything continued to function. Responses are slower than cloud-based LLMs, but they’re accurate when the context chunks are relevant.

Chunking at 1024 characters with 200-character overlap worked well for most documents. Technical documentation with code blocks occasionally had issues when a code snippet was split across chunks, but this was rare.

ChromaDB’s persistence works reliably. I can restart the container without losing indexed data. The volume mounts in Docker Compose handle this correctly.

What Didn’t Work

CPU-only inference is slow. Generating a 200-word answer takes 20-30 seconds. I don’t have a GPU on this VM, and moving this workload to a machine with GPU acceleration isn’t practical for me right now. For interactive use, this latency is noticeable.

The first version of my ingestion script had no duplicate detection. Every time watch_docs.sh triggered, it re-embedded every file. This created duplicate chunks in ChromaDB with identical content but different IDs. I added a check to skip files that were already indexed by comparing filenames in metadata, but this still doesn’t handle updated files cleanly—if I edit a PDF, the old chunks remain unless I manually purge them.

Retrieval quality depends heavily on chunk size. I experimented with 512, 1024, and 2048 character chunks. Smaller chunks returned very narrow context that missed broader explanations. Larger chunks diluted relevance because unrelated content ended up in the same chunk. 1024 was a compromise, not a perfect solution.

I also tried using llama3.2:3b instead of the 8B model to speed up generation. The smaller model generated responses faster but frequently hallucinated details not present in the context. It was unusable for anything requiring accuracy.

Key Takeaways

A local RAG pipeline is practical for querying private documentation without sending data to external services. The setup is straightforward with Docker, but performance on CPU is limited.

Automatic ingestion using folder watchers works, but requires careful handling of duplicates and updates. A proper production version would track file modification times and re-chunk only changed files.

Chunk size affects retrieval quality more than I expected. Too small and you lose context. Too large and you retrieve irrelevant content. Testing with your actual documents is necessary.

Running this without GPU acceleration is slow but functional. If speed matters, GPU support or smaller models with lower quality trade-offs are required.

The entire system runs in two Docker containers and a few hundred lines of Python. Maintenance is minimal once it’s stable.

Previous article

Self-Hosting Open WebUI Behind Nginx Proxy Manager with SSO Authentication

Next article

Creating Custom Ollama Modelfiles for Domain-Specific Fine-Tuned Models

Leave a Comment

Your email address will not be published. Required fields are marked *

Search Articles

Jump to another topic without leaving the reading flow.

Categories

Browse more posts grouped by topic.

About the Author

Vipin PG

Vipin PG

Expert Tech Support & Services

Vipin PG is a software professional with 15+ years of hands-on experience in system infrastructure, browser performance, and AI-powered development. Holding an MCA from Kerala University, he has worked across enterprises in Dubai and Kochi before running his independent tech consultancy. He has written 180+ tutorials on Docker, networking, and system troubleshooting - and he actually runs the setups he writes about.

Stay Updated

Get new posts and practical tech notes in your inbox.

Short, high-signal updates covering self-hosting, automation, AI tooling, and infrastructure fixes.