Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing Automatic Model Quantization Pipelines: Converting HuggingFace Models to GGUF for Ollama with llama.cpp Scripts

Why I Built a Model Quantization Pipeline

I run several AI models locally on my Proxmox cluster. Some run in Docker containers, others in VMs. The problem I kept hitting was storage and memory. A 13B parameter model in full 16-bit precision takes roughly 26GB of disk space and similar amounts of RAM to run. When you’re managing multiple models, testing different versions, or just trying to keep things responsive on consumer hardware, this becomes unsustainable fast.

I needed a way to take models from HuggingFace, convert them to a format Ollama could use (GGUF), and quantize them down to sizes that actually fit my setup without completely destroying quality. After manually doing this process a few times, I automated it.

My Setup and Tools

I use Ollama to serve models locally. It’s clean, simple, and works well with my existing infrastructure. Ollama expects models in GGUF format, which is what llama.cpp uses. Most models on HuggingFace are distributed as PyTorch checkpoints or SafeTensors, so conversion is necessary.

The core tools I rely on:

  • llama.cpp — Provides the convert.py script for HF to GGUF conversion and the quantize binary for further compression
  • huggingface_hub Python library — More reliable than git cloning large models
  • Proxmox LXC containers — Where I run the conversion jobs, isolated from my main workloads

I do not use cloud services for this. Everything runs on my own hardware.

The Conversion Process I Actually Use

Downloading the Model

I tried git clone initially. It worked for smaller models but consistently failed with OOM errors on anything over 10GB. The huggingface_hub library handles this better.

I install it in a Python virtual environment:

python3 -m venv venv
source venv/bin/activate
pip install huggingface_hub

Then I use a simple script to pull the model:

from huggingface_hub import snapshot_download

model_id = "lmsys/vicuna-13b-v1.5"
snapshot_download(
    repo_id=model_id,
    local_dir="vicuna-hf",
    local_dir_use_symlinks=False,
    revision="main"
)

This downloads the full model to a local directory. No symlinks, no partial files. I verify it completed by checking the directory size and presence of config.json and weight files.

Converting to GGUF

I clone llama.cpp and install its Python dependencies:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

The convert.py script does the initial conversion. I run it like this:

python convert.py ../vicuna-hf 
  --outfile ../vicuna-13b-v1.5-f16.gguf 
  --outtype f16

I use f16 (16-bit float) as the intermediate format, not q8_0. This matters because I often quantize further to smaller formats like q4_k_m or q5_k_m. Quantizing from f16 preserves slightly better quality than going from q8_0 to q4_k_m.

The conversion takes 5-15 minutes depending on model size. The output is a single GGUF file.

Quantizing Further

The f16 GGUF is still too large for my use case. A 13B model in f16 is around 26GB. I quantize it down using the quantize binary from llama.cpp.

First, I compile it:

cd llama.cpp
make

Then I run quantization:

./quantize ../vicuna-13b-v1.5-f16.gguf ../vicuna-13b-v1.5-q4-k-m.gguf q4_k_m

This produces a q4_k_m quantized model, which is roughly 7-8GB for a 13B parameter model. The quality loss is noticeable but acceptable for most tasks. I’ve tested q3_k_s and q5_k_m as well. The former is too degraded for anything serious. The latter is a good middle ground if I have the memory.

Loading Into Ollama

Ollama has a simple import process. I create a Modelfile that points to the GGUF:

FROM ./vicuna-13b-v1.5-q4-k-m.gguf

Then I run:

ollama create vicuna-13b-q4 -f Modelfile

The model is now available locally. I can test it immediately:

ollama run vicuna-13b-q4

What Worked

Using f16 as the intermediate format — I initially went straight to q8_0 to save disk space. When I later quantized to q4_k_m, I noticed the model performed slightly worse than when I quantized from f16. The difference is small but measurable in coherence and instruction-following.

Running conversions in isolated containers — I spin up a Proxmox LXC with 16GB RAM and mount a shared NFS volume for model storage. This keeps the conversion process from interfering with other workloads. If something crashes, I just destroy the container.

Keeping the f16 version around — Disk space is cheap compared to re-downloading and re-converting. I keep the f16 GGUF as a “master” and quantize from it as needed. This lets me test different quantization levels without starting over.

What Didn’t Work

Using git clone for large models — It consistently failed on models over 10GB. The huggingface_hub library is more reliable.

Quantizing directly to q3_k_s — I tried this to save maximum space. The model became nearly unusable. Responses were incoherent and repetitive. q4_k_m is the practical lower limit for anything I expect to work well.

Skipping the f16 step — As mentioned, going straight to q8_0 and then further to q4_k_m resulted in measurably worse output. The extra disk space for f16 is worth it.

Running conversions on the Proxmox host directly — I did this once and it locked up other VMs when memory usage spiked. Containers are the right tool here.

Automating the Pipeline

I wrote a bash script that handles the full workflow:

  1. Download model from HuggingFace
  2. Convert to f16 GGUF
  3. Quantize to q4_k_m and q5_k_m
  4. Import into Ollama
  5. Clean up intermediate files

The script runs in a cron job triggered by a webhook from n8n. When I add a model ID to a list in my automation system, the pipeline kicks off automatically. I get a notification when it completes or fails.

I do not use this for every model. Only ones I know I’ll use repeatedly. For one-off tests, I still do it manually.

Key Takeaways

Quantization is not free — Every step down in bit depth costs quality. Test the output before committing to a quantization level.

Disk space matters less than you think — Keeping the f16 version around is worth it. Re-downloading and re-converting is more annoying than buying another drive.

Containers isolate risk — Running heavy conversion jobs in a dedicated LXC container prevents them from impacting other services.

q4_k_m is the sweet spot for me — It balances size and quality well enough for local inference on consumer hardware.

Automation only makes sense if you do this often — I quantize 2-3 models a week. The script saves time. If you’re doing this once a month, just do it manually.

Leave a Comment

Your email address will not be published. Required fields are marked *