Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

running deepseek-v3 on consumer hardware: quantization strategies and vram optimization for 685b parameter models

Why I Started Running DeepSeek-V3 Locally

I needed a model that could handle complex reasoning tasks without sending data to external APIs. DeepSeek-V3’s 685 billion parameters promised strong performance, but the challenge was clear: how do you run something this large on hardware I actually own?

My setup includes a workstation with 128GB RAM and an RTX 4090 (24GB VRAM). I also run Proxmox hosts with varying GPU configurations. The goal was to make DeepSeek-V3 usable for real work—not just load it once for a screenshot.

Understanding the Memory Problem

A 685B parameter model in FP16 precision requires roughly 1.37TB of memory. Even at FP8, you’re looking at 685GB. Consumer GPUs top out around 24GB. The math doesn’t work unless you quantize aggressively or split the load across CPU and GPU.

I started with the KTransformers framework because it supports heterogeneous CPU-GPU inference. The idea: keep critical layers on the GPU for speed, offload the rest to system RAM. This is not theoretical—I configured it, ran it, and measured what happened.

What VRAM Actually Gets Used For

VRAM holds three things during inference:

  • Model weights for GPU-resident layers
  • KV cache for active requests
  • Activation tensors during forward passes

With 24GB VRAM, I could fit approximately 8-12 billion parameters at FP16, or 16-24 billion at INT4. The rest had to live in system RAM and get swapped in as needed. This creates latency, but it’s the only way to run the full model without a cluster.

Quantization Strategies I Actually Used

INT4 Quantization for GPU Layers

I converted the most computationally expensive layers to INT4 using KTransformers’ weight conversion tools. The process:

python ktransformers/optimize/convert.py 
  --model_path deepseek-ai/DeepSeek-V3 
  --output_path ./deepseek-v3-int4 
  --format int4 
  --device cuda

This reduced memory requirements by roughly 4x compared to FP16. Quality degradation was noticeable but acceptable for most tasks—I saw slightly more repetition in long outputs, but reasoning chains remained coherent.

FP8 for Mixed Precision

For layers where INT4 caused too much quality loss (attention mechanisms, final projection layers), I used FP8. The conversion command was similar:

python ktransformers/optimize/convert.py 
  --model_path deepseek-ai/DeepSeek-V3 
  --output_path ./deepseek-v3-fp8 
  --format fp8 
  --device cuda

FP8 gave me 2x compression with minimal quality impact. The trade-off: more VRAM usage than INT4, but better output quality on complex prompts.

CPU Offloading Configuration

I used YAML-based optimization rules to specify which layers stayed on GPU and which moved to CPU. My working configuration for the RTX 4090:

optimization_rules:
  - match: "model.layers.0-15"  # Early layers on GPU
    device: "cuda:0"
    precision: "int4"
  
  - match: "model.layers.16-55"  # Middle layers on CPU
    device: "cpu"
    precision: "int8"
  
  - match: "model.layers.56-60"  # Final layers on GPU
    device: "cuda:0"
    precision: "fp8"

This split kept the most critical layers (early feature extraction and final output) on the GPU, while the bulk of parameters lived in system RAM. Inference speed dropped to about 3-5 tokens/second, but the model actually worked.

Real Performance Numbers

I ran the same prompt (a 2000-token code review task) across different quantization strategies and measured throughput:

  • Full INT4 on GPU (impossible with 24GB): N/A
  • Mixed INT4/FP8 with CPU offload: 3.2 tokens/sec
  • Full INT8 on CPU: 0.8 tokens/sec
  • FP8 only (layers that fit): 7.1 tokens/sec, but only 40% of model loaded

The mixed strategy was the only viable option. Pure CPU inference was unusable for interactive work. Trying to fit everything on the GPU meant either using extreme quantization (which broke reasoning) or running a partial model.

What Broke Along the Way

KV Cache Overflow

My first configuration set cache_lens too low. With max_batch_size=4 and max_new_tokens=1024, I needed at least 4096 tokens of KV cache space. I initially set it to 2048. The scheduler silently dropped requests instead of queuing them. Symptoms: requests would hang indefinitely with no error message.

Fix: I increased cache_lens to 8192 and added monitoring to track cache utilization.

Chunked Prefill Latency

Long prompts (over 4000 tokens) caused noticeable stalls during the prefill phase. The chunked prefill mechanism broke the prompt into 512-token chunks, but each chunk required a full forward pass through CPU-resident layers.

I reduced chunk_size from 512 to 256, which smoothed out latency spikes but increased total prefill time by about 15%. This was an acceptable trade-off for interactive use.

Quantization Artifacts

INT4 quantization on attention layers caused the model to occasionally loop on the same phrase. Example output:

“The function should validate input parameters. The function should validate input parameters. The function should validate…”

I moved attention layers to FP8, which eliminated the looping but increased VRAM usage by 2GB. This was the limit of what fit on my 4090.

Practical Deployment Setup

I run DeepSeek-V3 as a containerized service on one of my Proxmox hosts. The Docker container mounts the quantized model weights and exposes an OpenAI-compatible API endpoint.

Launch Command

python ktransformers/server/main.py 
  --model_path ./deepseek-v3-mixed 
  --backend_type balance_serve 
  --max_batch_size 4 
  --cache_lens 8192 
  --chunk_size 256 
  --max_new_tokens 1024 
  --port 10002

This configuration supports up to 4 concurrent requests. In practice, I rarely exceed 2 simultaneous users, so the batch size is conservative.

Resource Monitoring

I added Prometheus exporters to track:

  • VRAM utilization per layer
  • System RAM usage for CPU-offloaded layers
  • Token generation throughput
  • KV cache occupancy

The most useful metric: KV cache occupancy. If it stays above 90%, I know the system is close to dropping requests.

Trade-Offs and Limitations

Running a 685B model on consumer hardware means accepting significant compromises:

  • Speed: 3-5 tokens/second is usable for batch processing, painful for chat.
  • Quality: INT4 quantization introduces subtle reasoning errors on complex tasks.
  • Concurrency: More than 2 simultaneous users causes noticeable slowdown.
  • Memory pressure: System RAM usage spikes to 80GB+ during heavy loads.

This setup works for my use case (code review, documentation generation, batch analysis) but would not work for customer-facing chat or real-time applications.

Key Takeaways

Quantization is not optional for running DeepSeek-V3 on consumer hardware. INT4 is aggressive but necessary. FP8 is a better middle ground where VRAM allows.

CPU offloading makes the model usable but introduces latency. The split between GPU and CPU layers matters more than the quantization format.

KV cache sizing is critical. Undersizing causes silent request drops. Oversizing wastes RAM but provides headroom for burst traffic.

Monitoring is essential. Without visibility into VRAM, RAM, and cache usage, you’re guessing about why performance degrades.

This is not a production-grade setup. It’s a way to run a massive model locally for development and experimentation. If you need consistent sub-second latency, rent cloud GPUs or use a hosted API.

Leave a Comment

Your email address will not be published. Required fields are marked *