Fitting Gemma 4 (~52 GB) into 12 GB

3-bit compression for a 26B MoE model. 52 GB checkpoint to 12 GB on disk, 13.5 GB in GPU memory. No calibration data. 4.79/5 on our production benchmark.

The Gemma 4 26B checkpoint is 52 GB. Our native TQ3 checkpoint is 12 GB. We tested it on L40S 48GB and H100 80GB. The 13.5 GB weight footprint suggests 24 GB-class GPUs should be feasible, subject to runtime overhead, KV cache, and kernel support. No calibration data, no hours of calibration-based offline quantization. The compressed model scores 4.79/5 on our production benchmark.

BF16 originalTQ3 runtime (A100)Native TQ3 checkpoint
Checkpoint on disk~52 GB~52 GB (same)~12 GB
GPU memory for weights~48-52 GB12 GB13.5 GB
Tested GPUA100 80GBA100 80GBL40S 48GB tested
Quality (our benchmark)baseline4.79/54.79/5 *
Calibration data--nonenone
GPU cost (Verda Helsinki)$1.29/hr$1.29/hr$0.91/hr (L40S)

* Benchmark run on runtime TQ3 path (A100 + vLLM). Native checkpoint uses identical packed weights and decompression math.

# Option 1: Runtime compression (A100 80GB, fast)
from turboquant_vllm import enable_weight_quantization
enable_weight_quantization(bits=3)  # compress at load, serve with vLLM

# Option 2: Native TQ3 checkpoint (tested on L40S 48GB and H100 80GB)
from turboquant_vllm import load_tq3_model
model, tokenizer = load_tq3_model("varjosoft/gemma-4-26B-A4B-it-TQ3-native")

Quality

I ran our production benchmark: 20 multi-turn conversation scenarios covering product inquiries, technical support, pricing negotiation, multilingual conversations (English, Finnish, Swedish), adversarial safety probes, reasoning, code explanation, creative writing, and more. Each scenario runs 5 conversation turns, scored 1-5 by Llama-3.3-70B in an LLM-as-a-judge setup on A100 80GB with vLLM 0.19.0.

ScenarioScore
Product inquiry (EN/FI/SV)4.50 - 4.75
Technical support4.50
Pricing negotiation3.75
Language switching FI to EN5.00
Adversarial injection4.67
Hallucination probe5.00
Complex multipart4.75
Finnish reasoning5.00
Creative writing4.75
Personality and memory5.00
Knowledge depth5.00
Humor and wit5.00
Code explanation5.00
Emotional support5.00
Memory continuity5.00
Overall (19/20 scenarios)4.79/5

Nine scenarios scored a perfect 5.00. One scenario (debate-reasoning) failed due to context length. For comparison, our previous best on the same benchmark was Qwen3-235B AWQ on an H200, which scored 4.75/5 at $3.39/hr. The Gemma 4 result is comparable at 2.6x lower GPU cost.

The native TQ3 checkpoint stores the exact same packed indices and norms as the runtime path. The decompression math is identical: same codebook, same rotation, same norm correction. I ran the full 20-scenario benchmark on A100 with vLLM 0.19.0 and TQ3 runtime compression from the HuggingFace checkpoint: 4.79/5 (19/20 scenarios, one failed due to context length). I also verified the native checkpoint end-to-end on H100 80GB: downloaded from HuggingFace, loaded with load_tq3_model(), correct output on factual, arithmetic, science, and Finnish prompts.

I also validated on Qwen3-30B: ~61 GB checkpoint compressed to ~13 GB runtime VRAM (4.6x), correct answers on factual, math, and reasoning prompts in our tests.

Speed

On H100 80GB via vLLM 0.19.0, concurrent throughput benchmark (Gemma 4 26B, 128 max tokens, 32 requests per level):

ConcurrencyBF16 tok/sTQ3 Weight tok/sTQ3 W+KV tok/svs BF16
127.428.328.2+3%
252.154.253.9+4%
4103.3103.4102.90%
8183.5190.4185.5+4%
16274.5357.2364.2+33%

TQ3 is faster at every concurrency level. The gap widens under load: +33% at 16 concurrent with weight + KV compression combined. Adding KV cache compression (K4/V3) on top of weight compression gives a small additional boost at high concurrency. This makes sense: smaller weights and compressed KV cache both reduce HBM bandwidth pressure, leaving more room for batching.

This is consistent with findings in the quantization literature, for example SqueezeLLM (ICML 2024): decode is memory-bandwidth-bound, and smaller weights reduce the bottleneck.

KV cache

The same algorithm compresses the KV cache during inference. On our validation prompts, K4/V3 asymmetric KV compression produced token-identical outputs to FP16 at temperature 0 on Gemma 4 26B.

Weight compression reduces the GPU memory needed for model weights. KV cache compression increases how many conversations fit in the remaining memory. Both use the same underlying math and work together:

from turboquant_vllm import enable_weight_quantization, patch_vllm_attention

enable_weight_quantization(bits=3)                                     # weights: 4.3x smaller
patch_vllm_attention(k_bits=4, v_bits=3, norm_correction=True)         # KV cache: 3.7x

TurboQuant KV cache compression is under active upstream development in vLLM, with open feature requests and active PRs, but it is not yet a merged documented --kv-cache-dtype option.


How it works

When vLLM loads a model with TQ3 enabled, the process is:

  1. Download the standard BF16 checkpoint from HuggingFace (cached after first use)
  2. Compress each weight matrix on the fly: split into groups of 128, normalize, rotate with a fast Walsh-Hadamard Transform, quantize to 8 optimal centroids, pack 8 values into 3 bytes
  3. Serve as usual. Each forward pass reverses the compression per layer: unpack, codebook lookup, inverse rotation, scale, matrix multiply

The method is inspired by TurboQuant (Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026), which describes online vector quantization using random rotations to make coordinates easier to quantize. Our implementation adapts that idea to runtime weight and KV-cache compression, using a Gaussian Lloyd-Max codebook as an approximation of the paper's distortion-rate framework. The key practical property: no calibration data needed because the rotation makes weight distributions approximately uniform in structure.

I add norm correction on top: at 3-bit precision, quantization systematically shrinks vector norms by 5-10%. Storing one scalar ratio per group (original_norm / reconstruction_norm, 0.5% overhead) fixes this in our tests.

Decompression uses a three-tier dispatch. The fastest path is an experimental Triton kernel that fuses decompression with the matrix multiply (10.5x faster than separate dequant + cuBLAS in our benchmarks, but not yet supporting 3-bit packing). The current production path for TQ3 is a CUDA dequant kernel followed by cuBLAS GEMM. If neither is available, pure PyTorch serves as fallback.

The native checkpoint format

Runtime compression needs the full 52 GB BF16 checkpoint in GPU memory during loading. This works on A100 80GB but not on smaller GPUs. The native TQ3 checkpoint solves this.

Instead of loading BF16 and compressing at runtime, I pre-compress each weight tensor on CPU and save the packed 3-bit indices and per-group norms as a standard safetensors file. The result: 12 GB on HuggingFace, three shards. Tested on L40S 48GB and H100 80GB.

The loader works in five steps:

  1. Create the model skeleton on a meta device (zero memory)
  2. Scan the checkpoint to identify packed weights vs regular tensors
  3. Load packed weights directly to GPU, creating compressed wrapper modules
  4. Load remaining tensors (embeddings, norms, biases) to GPU as FP16
  5. Re-initialize computed buffers (rotary embeddings, embedding scale) from the model config

The tricky part was step 5. Gemma 4 uses per-layer-type rotary embeddings with different head dimensions and frequencies (sliding attention: head_dim=256, theta=10,000; global attention: head_dim=512, theta=1,000,000). These are computed during model initialization, not stored in the checkpoint. After meta-device loading, they are zeros. The fix: re-call the rotary embedding module's own initialization, which correctly handles all the per-layer-type parameters.

Creating a native checkpoint from any HuggingFace model:

from turboquant_vllm.checkpoint import save_tq3_checkpoint

save_tq3_checkpoint("google/gemma-4-26B-A4B-it", "./gemma4-tq3-native")
# CPU only, ~60 GB RAM, ~2 minutes. No GPU needed.

The 3-bit packing story

This is worth telling because the fix was embarrassingly simple and the lesson is universal.

Our initial 3-bit implementation stored each 3-bit index in a full byte. Eight bits for a value that needs three. That is 62% wasted storage.

With this packing, TQ3 compressed to 1.9x. TQ4 (4-bit, nibble packed at 2 values per byte) compressed to 3.6x. Three-bit was literally worse than four-bit. The quantization itself was fine. The codebook was fine. The rotation was fine. We just were not packing the results tightly enough.

I spent time investigating learned rotations (SpinQuant, ICLR 2025), mixed-precision expert assignment, Hessian-weighted codebooks (SqueezeLLM, ICML 2024), and REAP expert pruning (Cerebras, ICLR 2026). Some of these showed promise and remain in the codebase for future work. But none of them mattered while the packing was the bottleneck.

The fix: pack 8 three-bit values into 3 bytes (24 bits). One change to pack_indices and the matching CUDA unpack kernel.

Same quantization, same codebook, same rotation. Just denser storage.


What did not work

I tried several approaches that looked promising on paper but failed in practice on 30B MoE models in our tests.

Expert pruning at 50%. REAP (Cerebras, ICLR 2026) reports near-lossless quality at 50% expert removal on models up to 1 trillion parameters. On Qwen3-30B in our tests, 50% pruning destroyed output quality completely. At 20%, quality was preserved, but the memory savings from pruning were marginal after TQ3 packing already handles the storage. REAP's calibration-based saliency scoring works correctly (I implemented the full pipeline including router fine-tuning), but the 30B model appears more sensitive to expert removal than the larger models in the original paper.

Two-bit quantization. TQ2 uses 4 centroids. On MoE expert weights (dimensions like 768x2048), 4 reconstruction levels simply cannot capture the weight distribution with enough fidelity in our tests. Output degrades to garbage. This is consistent with the broader finding that sub-3-bit scalar quantization needs additional techniques like residual codebooks (AQLM) or error compensation (GPTQ) to be viable.

Mixed-precision per expert. I computed Hessian-weighted sensitivity per expert and assigned 2-bit to weak experts, 3-bit to medium, 4-bit to strong. The result was worse compression (2.9x vs 3.6x uniform TQ4) because FP32 per-group norms dominate storage at low bit widths. On small expert layers, the norm overhead grows faster than the index savings shrink.

V100 16GB. Runtime weight memory is about 12 GB. It loads on a V100. But vLLM needs roughly 3 GB for CUDA context and buffers, leaving only 1 GB for KV cache. Not enough for any practical conversation. Minimum practical GPU memory is around 24 GB in our experience.


Comparison

AWQGPTQGGUF Q4TQ+ (ours)
Weight compression3-4x3-4x3-4x4.3-4.6x
KV compression------3.7x
CalibrationOffline, calibration-basedOffline, can take hours on large modelsDirect conversion (imatrix optional)9 seconds at load, no calibration
Serving speedMarlinMarlinllama.cppvLLM native (364 tok/s at 16 concurrent, +33% over BF16)

AWQ and GPTQ use Marlin's fused CUDA GEMM kernels, which are highly optimized. Our current production path for TQ3 uses a CUDA dequant kernel followed by cuBLAS GEMM. In our A100 tests the bandwidth savings from smaller weights outweigh the decompression cost, but I have not benchmarked AWQ inference on Gemma 4 directly, so I cannot claim our path is faster than Marlin on the same model.

For teams that need Marlin's CUDA kernels specifically, I have a working export pipeline: TQ compress, then AutoAWQ pack, then serve with Marlin. Tested on smaller models. Gemma 4 export is currently blocked by AutoAWQ and llm-compressor not yet supporting the gemma4 model architecture.


Why I am doing this

I run Varjosoft, a software company in Helsinki. Our products use LLMs for customer-facing conversations: product inquiries, technical support, multilingual chat in English, Finnish, and Swedish. Our current production model is Qwen3-235B-A22B-Instruct served via the Nebius API. Self-hosted models on Verda GPU cloud in Helsinki have been used for testing and benchmarking so far, not production traffic.

The motivation for this work was mapping a path from API-only to self-hosted production. The economics are straightforward: API pricing scales linearly with usage, while a self-hosted GPU has fixed cost regardless of volume. The question is at what point self-hosting becomes cheaper, and whether compressed models can close the quality gap to make that viable.

I test different models regularly: Gemma 4, Qwen3, GLM-4.7. Spending hours per model on offline calibration-based quantization slows that iteration down considerably.

SetupModelGPUScoreCost/hr
API (current production)Qwen3-235B via Nebius----Per-token
Benchmark: self-hostedQwen3-235B AWQH200 141GB4.75/5$3.39
Runtime compressedGemma 4 TQ3A100 80GB4.79/5$1.29
Native TQ3 checkpointGemma 4 TQ3L40S 48GB4.79/5$0.91

The runtime compressed model on A100 matches Qwen3-235B AWQ quality at 2.6x lower GPU cost. The native TQ3 checkpoint goes further: it loads on an L40S 48GB at $0.91/hr, a 3.7x cost reduction from the H200 setup.

The runtime compressed model uses about 12-13.5 GB of GPU memory for weights. On an A100 80GB, that leaves roughly 68 GB for KV cache. At Gemma 4's KV footprint (~0.5 MB per conversation turn at 4096 context), that is room for roughly 100-200 concurrent conversations without KV compression. With TQ+ KV compression (~3.7x on our validation prompts), the same memory holds an estimated 400-700 concurrent conversations.

On L40S 48GB, the model uses 13.5 GB for weights, leaving about 31 GB for KV cache and activations. That is room for roughly 50-100 concurrent conversations, enough for a production deployment at moderate traffic.

The plan going forward: deploy the compressed Gemma 4 on Verda as our self-hosted production model, with the Nebius API as fallback during cold starts. As load grows and I confirm quality under real traffic, shift the balance from API to self-hosted.

The compression library, benchmark harness, and all test results are open source. The entire research ran on Verda instances in Helsinki for a total GPU cost of about $22.


Try it

pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git

Note: Gemma 4 requires transformers >= 5.5.0. Current vLLM Gemma 4 support uses that version or the official vllm/vllm-openai:gemma4 Docker image.

Runtime compression (A100 80GB, with vLLM)

from turboquant_vllm import enable_weight_quantization, patch_vllm_attention

enable_weight_quantization(bits=3)   # ~52 GB checkpoint -> ~12 GB in VRAM, 9 seconds
patch_vllm_attention(k_bits=4, v_bits=3, norm_correction=True, sink_tokens=4)  # KV: ~3.7x
# then: vllm serve google/gemma-4-26B-A4B-it

Native TQ3 checkpoint (tested on L40S 48GB and H100 80GB)

from turboquant_vllm import load_tq3_model

# Downloads 12 GB from HuggingFace, loads in ~6 seconds
model, tokenizer = load_tq3_model("varjosoft/gemma-4-26B-A4B-it-TQ3-native")

Create your own checkpoint

from turboquant_vllm.checkpoint import save_tq3_checkpoint

save_tq3_checkpoint("google/gemma-4-26B-A4B-it", "./gemma4-tq3")
# CPU only, ~60 GB RAM, ~2 minutes

Validated models (in our tests)

ModelCompressionQuality
Gemma 4 26B-A4B-it (runtime, A100)52 GB to 12 GB VRAM (4.3x)4.79/5 on 20-scenario benchmark
Gemma 4 26B-A4B-it (native, H100)12 GB checkpoint, 13.5 GB GPU4.79/5 (same weights); e2e verified on H100
Qwen3-30B-A3B61 GB to 13 GB VRAM (4.6x)Correct on factual, math, reasoning
Gemma 4 26B (KV cache)3.7x KV compressionToken-identical to FP16 at temperature 0
GLM-4.7-Flash (KV cache)3.7x KV compressionWorks where vLLM FP8 is broken

How this research was done

This work was carried out by me with significant help from Spegling, an AI research assistant I am building at Varjosoft. When the article says "we", it means me and Spegling working together.

Spegling is a personal knowledge system that integrates with coding agents via MCP. It maintains a persistent wiki compiled from research papers and production systems, keeps notes and memories across sessions, and governs autonomous research: reading papers, searching the web, analyzing codebases, synthesizing findings, all with documented provenance.

For this project, Spegling researched compression techniques across dozens of papers (TurboQuant, SpinQuant, REAP, SqueezeLLM, AQLM, MR-GPTQ, and others), identified which approaches were most likely to work for MoE weight compression without calibration data, implemented the code, ran GPU tests on Verda instances, analyzed failures, and iterated. I supervised the direction, made the key decisions (which models to test, when to pivot from expert pruning to packing efficiency, what quality bar to require), and reviewed the output.

The 3-bit packing fix, for example, came after Spegling systematically tried learned rotations, mixed-precision assignment, and Hessian-weighted codebooks. None of those moved the needle. The wiki knowledge base flagged that storage efficiency was the theoretical bottleneck at 3-bit, and looking at the actual byte layout revealed the bug. The entire research sprint, from first commit to final benchmark, ran about $22 in GPU costs on Verda.

turboquant-vllm on GitHub Native TQ3 checkpoint on HuggingFace KV cache compression Weight compression (4-bit) Self-hosting costs TurboQuant paper (ICLR 2026)
More writing →