Setting up automated model quantization pipelines with llama.cpp to convert new Hugging Face releases for local deployment

Why I Built an Automated Quantization Pipeline

I run several AI workloads on my local Proxmox cluster and Synology NAS. When a promising new model drops on Hugging Face, I want to test it quickly without waiting hours for manual quantization or burning through API credits. The problem: most new releases come as full-precision PyTorch checkpoints that don’t fit in my available RAM, and manually converting them with llama.cpp is repetitive work I kept doing over and over.

I needed a way to automatically detect new model releases, convert them to GGUF format, quantize to multiple bit depths, and store them where my containers could access them. This isn’t about running the absolute latest models—it’s about having a reliable process that turns Hugging Face releases into usable local deployments without manual intervention.

My Setup and Constraints

My quantization pipeline runs on a Proxmox VM with 32GB RAM and no GPU. I deliberately avoid GPU-dependent conversion steps because I want this to work on any machine in my network, including my Synology NAS where I sometimes offload batch jobs.

The pipeline uses:

A Python script that polls Hugging Face’s API for new model releases
llama.cpp’s conversion and quantization tools
Docker containers to isolate dependencies
Cronicle for scheduling (I covered this in my automation setup)
NFS mounts to a shared storage volume where quantized models land

I don’t use any cloud services for this. Everything happens locally, and the converted models stay on my network.

How the Conversion Process Actually Works

When I first tried to automate this, I assumed llama.cpp’s convert script would handle any model format thrown at it. That was wrong. The script expects specific file structures, and many Hugging Face models use safetensors or split checkpoint files that need preprocessing.

Here’s the actual sequence I ended up with:

1. Detecting New Releases

I wrote a Python script that queries Hugging Face’s API every 6 hours. It checks a predefined list of model repos I care about (mostly Mistral, Llama variants, and Qwen models). The script compares release timestamps against a local JSON file that tracks what I’ve already processed.

This isn’t sophisticated—it’s just a cron job hitting an API endpoint and comparing dates. But it works reliably, and I don’t have to manually check for updates.

2. Downloading and Preparing Models

Once a new release is detected, the script downloads the model using Hugging Face’s CLI tool. I specifically download only the model weights and tokenizer files—no training configs or extra metadata that bloats storage.

The download happens to a temporary directory that gets wiped after conversion. I learned this the hard way after filling up a 500GB volume with unconverted checkpoints.

3. Converting to GGUF

This is where things get messy. llama.cpp’s convert_hf_to_gguf.py script works well for standard Llama-style models, but it chokes on models with non-standard architectures or custom tokenizers.

I handle this by:

Running the conversion inside a Docker container with llama.cpp pre-built
Catching conversion failures and logging them instead of crashing the entire pipeline
Skipping models that don’t convert cleanly (I revisit these manually if they’re important)

The conversion step produces a single GGUF file in FP16 format. This file is usually large (13GB+ for a 7B model), but it’s the base I need for quantization.

4. Quantizing to Multiple Formats

I quantize each model to Q4_K_M, Q5_K_M, and Q8_0 formats. These cover my typical use cases:

Q4_K_M for quick testing and low-memory scenarios
Q5_K_M as a balanced default for most tasks
Q8_0 when I need better accuracy and have the RAM to spare

I don’t bother with Q2 or Q3 variants because the quality drop is too steep for my use cases. I also skip F16 and F32 because I’m explicitly trying to avoid full-precision models.

Quantization happens with llama.cpp’s llama-quantize binary. Each quantization level is a separate process, and I run them sequentially (not in parallel) to avoid memory pressure on the VM.

5. Storing and Cataloging Models

Quantized models get moved to an NFS share mounted across my network. The directory structure looks like this:

/models/
  mistral-7b-instruct-v0.3/
    Q4_K_M.gguf
    Q5_K_M.gguf
    Q8_0.gguf
    metadata.json

The metadata.json file stores conversion timestamps, source URLs, and any notes about issues I encountered. This helps me track which models are current and which need re-processing when llama.cpp’s conversion tools improve.

What Worked

The pipeline has been running for about four months. It’s processed 30+ models without manual intervention, and I haven’t had storage issues since implementing the temp directory cleanup.

Key wins:

I can test new models within hours of their Hugging Face release
Having multiple quantization levels pre-built saves time when switching between projects
Containerizing the conversion process means I don’t have llama.cpp build dependencies polluting my host system
The NFS setup lets me access models from any machine without duplicating storage

The biggest unexpected benefit: I stopped worrying about whether a model would “just work” locally. If it’s on Hugging Face and follows standard formats, it gets converted and is ready to use.

What Didn’t Work

The first version of this pipeline tried to handle every possible model architecture. That was a mistake. Some models use custom tokenizers or non-standard tensor layouts that llama.cpp’s converter doesn’t support. I wasted days trying to write preprocessing scripts before I realized it wasn’t worth the effort.

Now I just skip those models. If a conversion fails, it logs the error and moves on. I manually check the logs weekly to see if anything important was skipped.

Another issue: quantization is slow on CPU. A 7B model takes about 20 minutes to quantize to all three formats on my VM. For larger models (13B+), this can stretch to over an hour. I tried optimizing this with multi-threading flags, but the gains were minimal. The solution was just accepting that quantization is a background task and not worrying about speed.

Storage also became a problem faster than expected. Three quantization levels for a single 7B model use about 10-12GB combined. After processing 30 models, I was using over 300GB. I now have a cleanup script that deletes models I haven’t accessed in 60 days, but I had to add that after running out of space twice.

Handling Failures and Edge Cases

Not every model converts cleanly. Some Hugging Face repos have incomplete checkpoints, missing tokenizer files, or use architectures llama.cpp doesn’t recognize yet.

My pipeline handles this by:

Wrapping each conversion step in try/except blocks
Logging failures with full error messages to a dedicated log file
Sending a summary email (via a simple SMTP script) if more than 3 conversions fail in a single run

This means I don’t have to babysit the process, but I’m still aware when something breaks.

One specific failure mode: models with non-standard tensor names. llama.cpp’s converter expects specific naming conventions, and some fine-tuned models rename tensors during training. When this happens, the conversion crashes with a cryptic error about missing keys. I haven’t found a general solution—I just skip these models and note them in my tracking spreadsheet.

Integration with My Workflow

Once models are quantized and stored, I use them in two main ways:

Direct inference via llama.cpp’s CLI tools for quick tests
Serving through Ollama or text-generation-webui for longer sessions

The quantized GGUF files work seamlessly with both. I don’t have to maintain separate model formats or conversion pipelines for different tools.

I also use the quantized models in n8n workflows where I need local LLM inference. The automation setup I wrote about previously pulls models from the NFS share as needed.

Key Takeaways

Building this pipeline taught me that automation doesn’t have to be perfect—it just has to handle the common cases reliably and fail gracefully on the edge cases.

Specific lessons:

Don’t try to support every model architecture. Focus on what actually works with your tools.
Quantization is slow on CPU, but it’s still faster than doing it manually every time.
Storage adds up quickly. Plan for cleanup from the start.
Logging failures is more useful than preventing them. You can’t predict every edge case.
Containerizing conversion tools keeps your system clean and makes the pipeline portable.

The pipeline isn’t fancy, but it’s been running reliably for months and has saved me hours of manual work. That’s what matters.

Tech Expert & Vibe Coder

Why I Built an Automated Quantization Pipeline

My Setup and Constraints