Qwen3-30B takes 60 GB of GPU memory in BF16. That needs an A100 80GB at $1.29/hr. After applying TurboQuant weight compression at load time, it takes 17 GB. That fits on a 24 GB consumer card.
| Stage | GPU memory |
|---|---|
| Original BF16 from HuggingFace | 59.7 GB |
| After TQ4-g128 weight compression | 16.8 GB |
| Peak during 200-token generation | 26.3 GB |
Compression takes 4 seconds. The model generates coherent output on spot-check prompts. On Qwen3-30B, measured perplexity degradation is modest at +3.4%, though smaller dense models degrade significantly more (see results below). No calibration data needed, no separate quantization step. Any tested BF16 checkpoint compatible with the current integration path can be compressed at load time.
This extends the KV cache compression work using the same turboquant-vllm library. Same math, different target: model weights instead of the KV cache.
Why this exists
When a new model drops on HuggingFace, the BF16 checkpoint is usually the first thing available. Pre-quantized versions (AWQ, GPTQ) often appear later than the initial BF16 release, sometimes days to weeks afterward. During that window, running the model requires enough GPU memory for the full BF16 weights.
AWQ and GPTQ produce excellent results but are offline post-training methods. GPTQ depends on calibration data, and AWQ uses offline activation statistics, though AWQ is notably more data-efficient. Either way, someone has to run that process and upload the result. TurboQuant weight compression skips both: load BF16, compress at startup, serve.
from turboquant_vllm import enable_weight_quantization
enable_weight_quantization(bits=4, group_size=128)
# Now load any BF16 model. Weights compressed at load time.
How it works
The underlying quantization framework is TurboQuant (Zandieh et al., 2025), a data-oblivious vector quantization method with published results on KV cache compression and vector search. Applying it to model weights is our extension. After a random rotation (implemented here as a Walsh-Hadamard Transform for efficiency), coordinates follow a known distribution on the sphere that is well-approximated by a normal distribution in high dimensions, enabling precomputed scalar codebooks (Lloyd-Max) instead of learned quantization grids.
The key difference from KV cache compression: weight rows must be split into groups. Each group of 128 elements gets its own L2 norm and rotation. Without grouping, quantization error accumulates across the 48 transformer layers and the model produces nonsense. With group_size=128, the error stays local.
This is the same principle behind GPTQ's group_size parameter and AWQ's per-group scales. The difference: TurboQuant's codebook is precomputed from the spherical distribution, not learned from calibration data. No data dependency means instant compression at load time.
The MoE expert problem
Qwen3-30B is a Mixture of Experts model with 128 experts per layer. Standard linear layers account for only 3% of the model's weight memory. The other 97% is in expert weight tensors stored as 3D arrays: (128, 1536, 2048) for each layer's gate+up projection.
Compressing only the standard linear layers saves 2% of memory. To reach meaningful savings, the expert weights must be compressed too.
Expert weights in HuggingFace models are stored differently from standard layers. They are nn.Parameter tensors on custom MoE modules, not nn.Linear layers. The compression code detects them by their 3D shape, quantizes each expert's rows using the same group quantization, stores the packed result, and registers forward hooks to decompress one layer at a time during inference.
The memory flow during inference: all expert weights stay compressed. When a layer runs, its experts are decompressed into a temporary buffer, the forward pass executes, then the buffer is freed. Peak memory is the compressed model plus one layer's decompressed experts.
Results
Tested on H100 80GB on Verda GPU cloud in Helsinki.
Memory
| Model | BF16 | TQ4-g128 | Savings |
|---|---|---|---|
| Qwen3-30B (MoE, 128 experts) | 59.7 GB | 16.8 GB | 3.6x |
| Qwen3-0.6B (dense) | 1.2 GB | 0.6 GB | 2.0x |
Quality
| Model | Config | Perplexity | Delta |
|---|---|---|---|
| Qwen3-30B | BF16 (baseline) | 4.19 | |
| Qwen3-30B | TQ4-g128 | 4.33 | +3.4% |
| Qwen3-0.6B | BF16 (baseline) | 11.43 | |
| Qwen3-0.6B | TQ4-g128 | 17.93 | +56.8% |
Larger models tolerate weight compression much better. +3.4% perplexity on 30B versus +57% on 0.6B. This matches findings from @coffeecup2020's TQ3_1S implementation for llama.cpp, which showed near-Q4_0 quality on 27B models.
Spot-check outputs (Qwen3-30B TQ4-g128)
| Prompt | Baseline | Compressed |
|---|---|---|
| Capital of Finland | Helsinki | Helsinki |
| Explain attention | Accurate explanation | Accurate, different wording |
| Python prime check | Correct function | Correct function |
| Good morning in Finnish | Correct | "Hyvää huomenta" |
| Steel vs feathers | Correct reasoning | Correct reasoning |
Current limitations
Speed. Inference runs at 0.3 tok/s versus 12.5 tok/s uncompressed on Qwen3-30B. Each forward pass decompresses expert weight tensors, runs the computation, then frees the temporary buffer. This is pure PyTorch overhead. Production speed requires fused dequant-GEMM kernels (like Marlin for AWQ/GPTQ).
The current speed makes weight compression useful for memory-constrained batch processing and evaluation where you would otherwise not be able to run the model at all. Real-time serving needs the fused kernels.
3-bit quality. TQ3 (3-bit) works on 30B models but degrades on smaller ones. TQ4 (4-bit) is the safe default. Optimal bit allocation across layer types is future work.
How it fits with KV cache compression
This is the second feature in turboquant-vllm. The first, KV cache compression, reduces the per-conversation memory cost during inference. Weight compression reduces the fixed model memory cost. They are complementary:
from turboquant_vllm import enable_weight_quantization, patch_vllm_attention
enable_weight_quantization(bits=4, group_size=128) # 59.7 GB → 16.8 GB model
patch_vllm_attention(k_bits=4, v_bits=3) # 3.7x smaller KV cache
The same rotation and codebook math, the same compression primitives. The high-performance fused attention kernels from the KV cache work are separate; weight compression currently lacks equivalent fused dequant-GEMM kernels, which is why it is slower. Weight compression reduces what you need to fit the model. KV cache compression increases how many users you can serve on whatever hardware you have.
The path from garbage to 3.6x
Four iterations, $1.73 in GPU time.
| Attempt | Approach | Result |
|---|---|---|
| 1 | One norm per weight row (same as KV cache) | Garbage: "resource resource resource..." |
| 2 | Group quantization (group_size=128) | 4-bit works on 0.6B. 3-bit fails. |
| 3 | Test on 30B model | Both 4-bit and 3-bit coherent |
| 4 | Compress MoE expert layers | 59.7 GB to 16.8 GB. Quality preserved. |
The key insight: KV cache vectors are independent across layers, so per-vector norms work fine. Weight matrices form a chain where each layer's output feeds the next. Quantization error accumulates multiplicatively. Group quantization keeps the error local.
What is next
- Fused dequant-GEMM kernels. Eliminate the decompression overhead for production-speed inference. This is the main gap between the current implementation and practical serving.
- Combined weight + KV cache benchmark. Run the full 20-scenario eval harness with both compressions enabled.
- Dense model validation. Test on Llama and Mistral where all layers are compressible.
- Upstream contribution. Register as a vLLM quantization method (
--quantization turboquant) alongside GPTQ and AWQ.
Update
A fused CUDA dequant kernel is now shipped in turboquant-vllm. Weight decompression runs 6.3x faster than the PyTorch path (0.36ms vs 2.28ms per 4096x4096 layer on H100). The kernel performs index unpacking, codebook lookup, and inverse WHT rotation in a single GPU launch using shared memory. Still slower than AWQ/GPTQ's Marlin kernels (which fuse decompression into the matrix multiply), but practical for batch processing and models without pre-quantized checkpoints. Code: csrc/tq_weight_dequant.cu
Infrastructure
- GPU cloud: Verda. Helsinki, Finland. H100 80GB at $2.29/hr.
- Total GPU cost for all weight compression testing: ~$4