Why this exists
Modern LLM inference has a memory problem. A 70B parameter model in FP16 needs 140 GB of weights. Even at FP8 it needs 70 GB. Consumer GPUs top out at 24-32 GB. Server GPUs reach roughly 140-180 GB+ depending on SKU on the latest parts, but most production fleets still run on 80-96 GB cards. The math does not work for the biggest models unless you compress.
That compression is what "quantization" means in practice: take a 16-bit float weight, represent it with fewer bits, accept some loss of precision, and try to keep model behavior close enough to the original. The whole stack below the model layer (kernels, formats, hardware support, distribution channels) exists to make this trade-off survive in production.
Most senior engineers I know have heard of quantization but don't have the layered picture: which formats matter, which kernels run them, which hardware accelerates them, and where the competition actually is. This essay is that layered picture, written from inside one specific plugin (turboquant-vllm) that has had to navigate the whole stack to ship.
The piece ends with a strategic inflection point I am sitting on right now. I will tell you what the question is and what I think the answer is.
The short version, up front: after six weeks inside this stack, my conclusion is that the 3-bit weight scheme is not the product. The MoE infrastructure around it might be. The rest of the essay shows the work behind that conclusion.
Three families compete, plus a fringe fourth
Quantization techniques cluster into three live production families, with a fourth sub-1-bit research bucket on the edge. Each family occupies a different point on the cost/performance/precision curve. They are not strictly ranked. They live in parallel and serve different needs.
Family 1: Float-format quantization (FP8, MXFP4, NVFP4)
These formats are hardware-native. A weight stored in FP8 has 8 bits arranged as one sign bit, four or five exponent bits, and the rest mantissa. The hardware (Hopper for FP8, Blackwell for NVFP4) has tensor cores that read these formats directly. No software dequantization step. The matrix-multiply units consume the format as their input precision.
This is the commodity bucket on supported hardware. Zero calibration data needed. Convert the weights once, point the inference engine at them, get fast inference. Critical nuance: NVFP4 is commodity only on Blackwell-class native paths. On older NVIDIA cards it exists as a storage format, not a speed path. The tensor-core acceleration that makes NVFP4 win lives in Blackwell silicon.
The current king at 4-bit is NVFP4. NVIDIA's own format, 16-element blocks (smaller than MXFP4's 32-element blocks), with E4M3 FP8 scale factors at fine granularity. Per NVIDIA's published numbers: <1% accuracy degradation in standard LLM evals, 3.5× memory vs FP16, 1.8× vs FP8. End-to-end inference speedups depend on the specific model and serving stack; NVIDIA cites material throughput gains on Blackwell-class hardware in the developer-blog post (linked in the Sources appendix).
The catch: NVFP4 is Blackwell-only for the native tensor-core fast path. Older NVIDIA hardware (Hopper, Ampere, consumer cards before RTX 5090) does not get that acceleration. Older hardware can store and convert such checkpoints through framework-specific fallback paths, but not with Blackwell's native tensor-core acceleration.
Family 2: Calibrated integer (GPTQ, AWQ, SmoothQuant)
These are the mature production techniques from 2023-2024. INT4 weights with per-group or per-channel scale factors fitted to calibration data. Run a few hundred examples through the model, learn which weight clusters are important, scale them to minimize accuracy loss, freeze the scales.
The kernels are mature. Marlin (W4A16, Ampere and later) and Machete (mixed-precision, Hopper and later) live inside vLLM and serve heavy production traffic. Most major model families ship an AWQ or GPTQ variant shortly after release, often within days for the most popular models.
This bucket has crystallized. The papers are old. The kernels are stable. The hardware is supported broadly. If you want INT4 inference today, you pick one of these and move on.
Family 3: Rotation plus codebook (HIGGS, QuIP#, TurboQuant, ITQ3_S)
The newest family, and the one I have spent the most time inside. The conceptual move: multiply weights by a Hadamard matrix or random orthogonal rotation before quantization. The rotation has a magical property called "incoherence" that tightens the weight distribution. Once tight, a small Lloyd-Max codebook of 8 centroids does very well at 3 bits.
Zero calibration data needed. The mathematics says incoherence is enough; you do not need to learn from examples.
The members of this family differ in details:
- HIGGS (the Linearity Theorem paper, Cornell + Yandex 2024-2025): scalar quantization on Hadamard-rotated weights with Gaussian-MSE-optimal grids. Reference implementation in Hugging Face Transformers. The vLLM form is my own PR #39970 (still open with maintainers; API name kept as
--quantization turboquantfor plugin-package compatibility). - QuIP# (Cornell 2024): uses an E₈ lattice codebook (8-dimensional vector quantization) instead of scalar. Theoretically better at low bits. Llama-2-70B fits in under 20 GB at 2 bits.
- TurboQuant (Google DeepMind, ICLR 2026): the original paper turboquant-vllm builds on. Uses PolarQuant rotation plus a 1-bit QJL residual for KV cache. Achieves 3-bit KV with zero accuracy loss, 6× memory savings, 8× attention speedup on H100. Also has a weight-quantization arm.
- ITQ3_S (Yoon, March 2026 arXiv preprint): FWHT rotation + interleaved ternary fused into MMQ kernel. Targets llama.cpp's IQ3_S format on consumer Blackwell. 70B model in 32 GB on RTX 5090, 1.5× throughput.
- TurboQuant+ (my plugin): production vLLM integration of the same rotation-plus-codebook family. 3-bit WHT-rotated weights, Lloyd-Max codebook, per-group norms. Adds MoE-specific tooling on top.
The whole family is alive. New variants land every month. Sometimes from labs, sometimes from individual researchers, sometimes from production teams who needed something the academic papers did not provide.
Family 4 (fringe): Sub-1-bit
BTC-LLM, BiLLM, NanoQuant. Sub-1-bit weights via learnable transformations or binary clustering. The accuracy gap is real (3-5% on Llama-2-13B at 0.8 bits), but the memory wins are extreme. BTC-LLM was withdrawn from ICLR 2026 review but the technique itself works. This bucket is interesting for research, not for production yet.
What changed recently
The list is short but consequential:
- NVIDIA published the Vera Rubin platform as the successor to Blackwell, with partner availability slated for H2 2026 (not yet shipping). NVIDIA's published platform claims include 288 GB HBM4 per GPU and a sizeable NVFP4 FLOPS uplift, plus inference and cost-per-token improvements over the Blackwell generation. Specific numbers vary across NVIDIA's own materials; see the developer-blog reference in the Sources appendix. Vendor claims, not independently benchmarked.
- ARCQuant (Augmented Residual Channels, January 2026) and Four Over Six (adaptive block scaling, December 2025) are NVFP4-targeted improvements that narrow the small remaining accuracy gap.
- DeepSeek-V4-Pro (April 24, 2026): 1.6 trillion parameters total, 49 billion active, hybrid attention (CSA + HCA). Per the DeepSeek model card and API docs: 27% of V3.2's FLOPs and 10% of its KV cache at 1M context. The KV-focused quantization value proposition shrinks for this family because they already compressed KV at the architecture level.
- Qwen 3.6 family (Alibaba, late March 2026):
- Qwen3.6-35B-A3B (35B total / 3B active, MoE, partial-rotary attention): the open-weight self-hostable variant, my test target. HF model card reports 73.4 on SWE-bench Verified, 49.5 on SWE-bench Pro, 51.5 on Terminal-Bench 2.0. This is the model behind every Qwen 3.6 number I cite in the war stories.
- Qwen 3.6 Plus / Max-Preview: separate, larger, hosted/proprietary variants on a different track from the open-weight family member above. Parameter counts not officially published. Not the same model as Qwen3.6-35B-A3B, and not open-weight in the same sense.
- GLM-5.1 (April 7, 2026): 754B MoE, MIT license, 58.4% SWE-Bench Pro per public comparisons.
- Kimi K2.6 (Moonshot AI, April 2026): 1T total, 32B active, 384 experts (8 routed + 1 shared), MLA, 256K context per the HF model card. Adopts a native INT4 quantization method (same as Kimi-K2-Thinking) alongside the primary BF16 distribution. This native-INT4 path is the signal that matters.
- Llama 4 Scout (Meta, April 5, 2025): 17B active / 109B total / 16 experts MoE, multimodal, ~10M context. Released under the Llama 4 Community License (a custom commercial license, not Apache 2.0 or another OSI-approved license).
The pattern: the major open-weight releases most relevant to this work are MoE. Several of them now ship a native-quantized variant alongside BF16 as a supported first-class format. The "convert from BF16 to your favorite quant scheme" workflow still works for most models, but the assumption that BF16 is the only sensible starting point is weakening.
War stories from the quantization trenches
The 50× slowdown that opened with tl.gather
This was PR #39970 in vLLM, my upstream contribution for HIGGS-scalar. Memory compression was working (3× at load time, 4.96 GB instead of 15.27 GB for Qwen3-8B in BF16). The kernel produced correct output end-to-end. Verified bit-equivalence to the reference dequant in CPU.
The problem: at batch size 1, decode was 50× slower than BF16. The first vLLM reviewer's question was simple: "Why would I use this over AWQ?" There was no answer until the gap closed.
Stage 1 was hypothesis listing without measurement. Four candidate causes: the inline dequant-plus-GEMM, broken CUDA graph capture, an eager _rotate_input running per layer, weak tensor-core utilization at BLOCK_M=1. All plausible. Only way to know was to measure.
Stage 2 was the first cheap win. The kernel's 3-bit decode had this load pattern:
bi0 = g8 * 3 + fb
b0 = tl.load(packed_ptr + packed_row[:, None] * stride_cn + bi0[None, :] * stride_ck, ...)
b1 = tl.load(packed_ptr + packed_row[:, None] * stride_cn + (bi0+1)[None, :] * stride_ck, ...)
Two 2D scatter loads per (k, n) pair. Non-unit-stride index pattern. Triton cannot vectorize this. Each byte becomes its own memory transaction.
The fix was one coalesced bulk load of all 48 packed bytes per row (padded to 64 for Triton's power-of-two tile constraint), then two in-register tl.gather operations:
byte_offs = tl.arange(0, 64)
valid_byte = byte_offs < 48
bulk = tl.load(packed_ptr + packed_row[:, None] * stride_cn + byte_offs[None, :] * stride_ck,
mask=mask_n[:, None] & valid_byte[None, :], other=0).to(tl.int32)
b0 = tl.gather(bulk, bi0_bc, axis=1)
b1 = tl.gather(bulk, bi1_bc, axis=1)
5× speedup at batch size 1 on Qwen3-8B. From ~2 tok/s to 8.35 tok/s on an A100. Cost of the run that proved it: about €0.60.
This is the rhythm of quantization work. You think the bottleneck is one place. The actual bottleneck is somewhere else (specifically: the data layout of your weight loads). You measure, you fix, you re-measure, you ship.
The MoE 500× gap
About a week after the dense-model fix landed, I extended the kernel path to MoE. Qwen3-30B-A3B-Instruct-2507 with TQ3 on an RTX PRO 6000 Blackwell. The measurement run that produced the numbers below dates to 2026-04-23. Active parameters per token: 3 billion. At TQ3 = 3 bits per param: 1.125 GB of packed weight reads per token. HBM bandwidth on that card: 1.8 TB/s.
Theoretical memory-bound ceiling on this HBM3-era card: 1.125 GB / 1.8 TB/s = 0.625 ms per token = ~1,600 tok/s. (Newer HBM4 parts like the upcoming Rubin generation push the ceiling higher; older HBM2e drops it.)
Measured: 2 tok/s = 500 ms per token = 800× slower than memory-bound.
This number reframed the entire roadmap. Every optimization I had been considering targeted the wrong level. QuantSpec for KV bandwidth: KV is 1% of per-token time on MoE, so a perfect QuantSpec would save 5 ms out of 500. Custom dense GEMV kernels: at 672 kernel launches per token in MoE decode, even a free kernel saves 28 ms total. The 500× gap was not in the arithmetic. It was in the pipeline around it.
Five hypotheses for where the time was going, ranked by likelihood:
- CUDA graphs break for MoE routing. Graphs require structurally identical forward passes; each token picks a different top-8 experts out of 128, so dispatch-to-different-kernels cannot be captured. Without graphs, every kernel launch pays full Python + CUDA launch overhead.
- Triton's
fused_moeat batch size 1 pads activations to tile size. Same pathology as dense Triton GEMV at bs=1, except worse, because each expert is its own tile. - Sequential per-expert dequant. Our
TurboQuantFusedMoEMethod.apply()looped over active experts and calleddecompress_into(scratch_pool)one at a time. 8 sequential dequant launches per layer. - Python-side routing overhead through PyTorch autograd-style ops.
- vLLM's scheduler / paged attention misbehaving for dynamic MoE routing.
Profiling with nsys (one ~€0.60 run) showed (3) dominated. The fix was the sparse dequant kernel that only decompresses active experts per forward instead of all 128. Measured on H100 with Qwen3-30B-A3B-TQ3, 1k context, eager mode: 1.22 → 10.23 tok/s = 8.41× (PR #33, validated A/B with enforce_eager=True both runs).
A separate fix landed the same week for a different model. Qwen3.6-35B-A3B has partial-rotary attention (head_dim=256, partial_rotary_factor=0.25 → rotary_dim=64), which fell through the CUDA dequant path and into a Python fallback for every q_proj / k_proj on full-attention layers. Adding block-diagonal Walsh-Hadamard combos (128, 3, 64) etc. to the supported CUDA kernel routes those layers through the fast path. Measured on RTX PRO 6000 Blackwell with Qwen3.6-35B-A3B-TQ3, eager mode: 1.84 → 18.2 tok/s = 10.1× at 1k context, 9.7× at 4k, 9.1× at 8k (PR #36).
Two separate stories. Two different models, two different GPUs, two different bottlenecks, two PRs, two big speedups in two weeks. Conflating them sounds tidier but it is not what happened. The honest summary: 8.41× from sparse dequant on dense-routing 30B MoE, plus 10.1× from block-diagonal WHT on partial-rotary 35B MoE. Both are MoE-infrastructure wins, not quantization wins. The quantization scheme inside is one valid implementation; the wrapper around it is what produces the speedups.
The graphs-on regression
After the eager-mode wins, I tried turning CUDA graphs back on, expecting another speedup. Instead: 18.2 tok/s eager became 5.5 tok/s with graphs on. The same plugin, with graphs, was 3.3× slower than without.
The investigation ran from mid-April (commits around 2026-04-12) through the refactor that landed 2026-04-28. The pattern is called "graph fragmentation": when a plugin inserts custom ops into a vLLM forward pass, the graph capture splits at every custom-op boundary into many smaller graphs. The replay overhead of dozens of tiny graphs is worse than running everything eager.
A candidate fix landed as PR #42 in turboquant-vllm this morning (2026-05-12): wrap the CUDA dequant in torch.library.custom_op, which tells PyTorch's graph capture to treat the whole dequant as a single opaque operation. Single boundary instead of many.
The fix has not yet been verified on GPU. Validation infrastructure for the GPU run is in place; the result will land in a follow-up post.
Why the hardware layer matters: CUDA, Triton, MLX, Metal
Quantization is not just a math problem. It is a kernel problem. The math chooses the format. The kernel decides whether the format wins or loses in practice.
CUDA and the FLUTE story
FLUTE is the most widely cited fast kernel implementation for HIGGS-style quantization on Ampere+ NVIDIA hardware. It is the reason HIGGS achieves 2-4× speedup at batch sizes under 32. Three ingredients carry the speedup:
mma.sync: Ampere's tensor core instruction. Takes 16×8×16 FP16 fragments per warp and does the matrix multiply at peak throughput.cp.async: asynchronous memory copy, lets weight loads overlap with compute.- LUT-in-shared-memory: the 8-entry codebook is loaded once per CTA into shared memory, reused thousands of times.
Plus the small but critical detail: static bit-shift unpacking via qmap (2-bit lookup) and qmap2 (4-bit vectorized lookup) so dequant is single-cycle per pair of indices.
This whole stack is Ampere-and-later only. On A100, RTX 4090, H100, RTX 5090, RTX PRO 6000: full speed. On older NVIDIA: degraded. On AMD: needs HIP port. On Apple Silicon: nothing ports directly.
Triton: the kernel-writing escape hatch
Triton (the language, OpenAI's project) is how most quantization plugins get written. Python-like syntax, compiles to PTX for NVIDIA, ROCm for AMD. The cost: not as fast as hand-written CUDA. The benefit: writeable in a week instead of a month.
My PR #39970 to vLLM is Triton. The fix for the 50× slowdown was a Triton optimization. The fused MoE backend in vLLM is Triton. The graphs-on regression is partly because Triton kernels don't always cooperate cleanly with PyTorch's graph capture.
Triton is where new ideas land first because it is fast to write. Production ideas eventually get hand-written CUDA versions for the last 10-30% of performance. Marlin and Machete in vLLM are hand-CUDA. FLUTE is hand-CUDA via CUTLASS abstractions. The Triton versions usually keep existing for breadth of model coverage.
MLX and Metal: the Apple Silicon side
MLX is Apple's machine learning framework for M-series chips. It uses Metal shaders for the actual compute. Most quantization on Apple ML stacks goes through MLX's built-in quantized_matmul, which uses a different scheme (linear affine: scale + bias) from HIGGS or TQ3.
For HIGGS-style quantization on a Mac, you have to write a custom Metal shader. MLX exposes a custom-kernel API (mx.fast.metal_kernel in Python, metal_kernel() in C++) for exactly this. The hardware story differs from CUDA in important ways:
- No
mma.syncanalog. Metal'ssimdgroup_matrix_multiply_accumulateis 8×8 FP16 only, significantly lower throughput per cycle than Ampere'smma.sync. - No
cp.async. Metal's load/compute concurrency must be expressed via the threadgroup memory model, not async pipelining primitives. - No CUTLASS. Hand-written MSL only.
What does port: shared-memory LUT patterns, static bit-shift unpacking, warp-shuffle reductions (Metal's simd_shuffle_* maps 1:1 with CUDA's __shfl_sync family).
llama.cpp's Q3_K Metal shader is the production proof that sub-byte GEMV on Metal works. It runs at 40-60 tok/s on Llama-8B Q3_K on M2 Ultra, 20-30 on M2 Pro / M4 Pro. No tensor cores. Just careful register allocation, simd_sum() reductions, and a clever mask-add trick that avoids needing a real LUT.
The plugin has a real Mac story. The TQ3 MLX port ships, loads published HF checkpoints, and serves real models end-to-end. The flagship result on M4 Pro 48 GB is the varjosoft/Qwen3.6-35B-A3B-TQ-apex3 checkpoint: 29 tok/s steady-state decode at 96.5% GSM8K-200, in 18 GB on disk — 2 percentage points better accuracy than mlx-community/Qwen3.6-35B-A3B-4bit and 1 GB smaller. First TurboQuant checkpoint to beat the MLX-community 4-bit reference on accuracy while shipping smaller.
So why isn't it faster? Because the 29 tok/s ceiling on M4 Pro is set by Lloyd-Max codebook lookup cost — the floor measured at ~54 μs per call at async steady state. MLX's built-in affine quantization (scale + bias, no codebook lookup) runs roughly 2× faster per kernel at the same bit budget, but ships at lower quality. The trade-off is fundamental to the codebook choice, not the kernel implementation. The Lloyd-Max grid is what gives TQ3 its quality edge; that same lookup is what caps speed.
This is the honest framing of the Mac story: a working Mac story that wins on quality at 3-bit, trails MLX-affine on speed because of an inherent property of codebook quantization. The earlier 1 tok/s number on Qwen3.5-35B-A3B was a pre-optimization measurement before the mx.compile path and active-only expert dequant landed. The current steady-state on the apex3 checkpoint is 29 tok/s. The remaining work is bs=N prefill, larger-MoE scaling, and the codebook-lookup cost itself (if a clever Metal-side scheme can reduce the 54 μs floor).
Why this matters for strategy
The hardware/kernel layer is where most of the genuine moats live. You can clone a paper in a weekend. You cannot clone a CUDA kernel that took six months to tune. Mature production plugins (vLLM, llama.cpp, TensorRT-LLM) compete on kernel quality, not on which paper they implemented.
Every quantization plugin that survives long enough has to make peace with this. The plugin's first version implements the paper. The plugin's second version writes a custom kernel. The plugin's third version targets a specific hardware generation and accepts that older hardware is degraded. By the fourth version you have a real codebase.
The MoE story: why mid-2026 changed everything
The major open-weight releases most relevant to this work are Mixture-of-Experts. DeepSeek-V4-Pro (1.6T total, 49B active per the DeepSeek model card). Llama 4 Scout (109B total, 17B active across 16 experts). Qwen3.6-35B-A3B (35B total, 3B active) is the open-weight partial-rotary variant my benchmarks target. GLM-5.1 (754B total). Kimi K2.6 (1T total, 32B active across 384 experts). Closed-weight frontier labs (Anthropic, OpenAI, Mistral large) keep dense flagships in their lineups; what I see directly on the open-weight side is dominated by MoE.
The MoE move is not new but became dominant in 2026. Why: dense models hit a wall around 70B-100B parameters where training cost grows faster than capability. MoE keeps total parameter count growing (more capacity) while active parameter count per token stays flat (manageable inference cost).
What changes for quantization:
The active-parameter-per-token number sets the memory bandwidth ceiling. For Qwen3-30B-A3B that is 3B active. At 3-bit, that is 1.125 GB of packed weights per token. At HBM bandwidth of 1.8 TB/s, the memory-bound ceiling is ~1,600 tok/s. The math is much friendlier than for dense 30B models, which would need to touch all 30B parameters and run at ~120 tok/s memory-bound.
Dispatch becomes dominant. Each token's forward pass picks a different subset of experts. Modern MoEs (Qwen3, Kimi K2.6) route to 8 of 128-384 experts. The dispatch logic (router → softmax → top-k → gather-to-experts → scatter-back) has overhead that dense models do not. CUDA graphs break for it. Triton kernels need to handle dynamic shapes. Custom kernels need an ids buffer pattern.
Most quantization papers assume dense. HIGGS, QuIP#, TurboQuant (the original Google paper) all benchmark on dense Llama-7B / Llama-2-70B / Mistral. The math works the same for MoE weights (an MLP is an MLP, whether it lives in a dense layer or an MoE expert). The kernels do not. Every quant-on-MoE plugin has to write its own dispatch and per-expert handling.
This is the gap the plugin has been filling. The MoE-specific tooling (sparse expert dequant kernel, native-packed checkpoint loader, partial-rotary block-diagonal WHT for MiniMax M2.5 / M2.7 and the Qwen3.6-A3B family, REAP expert pruning integration) is in my plugin but not in HIGGS, not in QuIP#, not in stock vLLM. Most of it landed in two intense weeks: sparse dequant 2026-04-23, partial-rotary block-diag WHT 2026-04-22 to 2026-04-25, native-packed MoE loader 2026-05-11 and 2026-05-12.
Both the 8.41× and 10.1× decode wins from earlier in this essay are MoE-infrastructure wins, not weight-quantization wins. The quantization scheme inside is one valid implementation; other schemes (INT4 from upstream Marlin, eventually NVFP4 from Blackwell tensor cores) could plug into the same MoE plumbing if the plumbing were made scheme-agnostic.
Where TurboQuant+ actually sits
Honest assessment of what I have built, against the rest of the landscape:
What is not unique:
- The TQ3 weight scheme itself. HIGGS-scalar is the same conceptual approach (Hadamard rotation + Lloyd-Max codebook). Reference implementation in HF Transformers. The vLLM form is my own PR #39970 (currently open). My PR is literally HIGGS-scalar with the
turboquantAPI name kept for plugin-package compatibility; I am one implementation, not the unique source of the idea. - The custom dequant CUDA kernels. Marlin is faster for INT4. NVFP4 native tensor cores will beat anything I write at FP4. My CUDA work is good but not differentiated.
- The 3-bit math itself. ITQ3_S (Yoon, March 2026) independently reaches a similar design point targeting llama.cpp's MMQ kernel. Two parallel implementations from independent researchers converging on the same family.
What is genuinely unique:
- The sparse expert dequant kernel (only touches active experts in a single launch). No equivalent in HIGGS, QuIP#, or stock vLLM.
- The native-packed MoE loader and per-expert-to-fused regroup logic (PR #44). Addresses a real vLLM gap. The regroup pattern is scheme-agnostic and portable.
- The block-diagonal WHT for partial-rotary attention models (MiniMax M2.5 / M2.7, Qwen3.6-A3B family). Discovered while debugging the 500× gap; PR #36 measured 10.1× decode on Qwen3.6-35B-A3B.
- REAP expert pruning integration. Pruning lives upstream in separate libs; the integration with quantization-aware loading is mine.
The pattern: my unique value is MoE infrastructure wrapped around a TQ3 scheme. The wrapper is the moat. The scheme inside is commodity now.
The market signal: native-quantized as a first-class format
The piece of context that ties everything together is the Kimi K2.6 release. Moonshot AI ships K2.6 in BF16 as the primary distribution, but with a native INT4 quantization method (the same one they shipped with Kimi-K2-Thinking) baked into the architecture as a supported first-class deployment path. INT4 is not a community afterthought; it is an officially-supported native serving format.
This is the emerging pattern. Native INT4 / FP8 paths are increasingly shipped alongside BF16 distributions, sometimes with calibration that the model authors did themselves. Several open releases in the last six months have moved in this direction (Kimi K2 family, parts of the DeepSeek family). The trend is not yet universal, and BF16 is still the canonical starting point for most models, but the gap between "official BF16" and "official native quantized" is narrowing.
For my plugin, this means: the workflow of "load BF16 → save_tq3_checkpoint() → serve TQ3" still works for most models, but for the largest releases with author-blessed native quantization, it is increasingly an extra hop with diminishing returns. For Kimi K2.6 specifically, you would have to use the BF16 source (which exists) but the question is whether you would bother going from BF16 to TQ3 when Moonshot's own INT4 calibration is competitive and vLLM's existing W4A16 Marlin kernel serves it fine.
A reviewer raised the practical point on a recent issue: downloading FP8 models is typically faster and requires less storage than BF16. The plugin's input assumption is increasingly friction for the very largest models.
The strategic inflection
The uncomfortable conclusion is that TQ3 itself is no longer the scarce part.
Rotation plus codebook quantization is now a family, not a secret. HIGGS, QuIP#, TurboQuant, ITQ3_S, and nearby variants are converging on the same design space. The math is in arXiv, the kernels are in upstream repos, and the next variant lands every few weeks.
What still looks underbuilt is the MoE serving layer around those formats: sparse expert dequant, native-packed loading, partial-rotary handling, graph behavior, per-expert dispatch. That is the part worth extracting.
The decision in front of me, then:
Option A: Compete on the weight scheme. Keep chasing the frontier with TQ3. Try to close the gap to NVFP4 on Blackwell. Try to add FP4 support, INT4-native loading, sub-4-bit variants. The arms race has been steady (ParoQuant arXiv Nov 2025, ARCQuant Jan 2026, ITQ3_S March 2026, BTC-LLM withdrawn from ICLR 2026 review but technique active). I cannot outrun this as one person on a side project.
Option B: Reposition as MoE infrastructure. The MoE tooling I have built is the genuinely unique value. The sparse dequant kernel works because it is scheme-agnostic in principle (it touches active experts; the dequant itself could be any format). The native-packed loader works because it is scheme-agnostic at the regroup level. If I extract this layer into a separate plugin (call it moe-vllm-tools or whatever name lands), it can sit on top of the existing FP8/INT4/NVFP4 weight quantization paths in vLLM, not compete with them.
Option C: Publish and sunset. Write up the two MoE decode improvements (8.41× from sparse dequant, 10.1× from partial-rotary block-diagonal WHT) as a blog post or short paper. Let TQ3 fade to maintenance for existing users (the Gemma-4 deployment is documented working; the Qwen3.6 checkpoint will serve correctly once the multimodal-config fix re-uploads). Move on to whatever the next project is.
My current read: B and C in parallel.
The MoE infrastructure deserves to be extracted into its own track. The next 4-6 weeks should be: refactor the sparse expert dequant kernel to be scheme-agnostic (callback per expert that takes any quantized representation and produces FP16 / BF16), ship that as a separate small focused vLLM PR sequence, write up the findings.
Meanwhile turboquant-vllm stays at its current scope. The existing TQ3-native checkpoint users keep working. v0.13.4 ships cleanly (validation in progress). The plugin enters a maintenance phase where I respond to bug reports and dependency updates but do not chase new quantization research. The TQ3 layer is what it is.
The reason for the split: the MoE infrastructure is genuinely useful to anyone running MoE models on vLLM, regardless of what weight format they use. The TQ3 weight scheme is one valid choice among many that compete on a curve I cannot win.
The reason against keeping them bundled: every time I update the TQ3 quantization layer (kernel fix, dequant tweak, format change), I have to verify nothing broke in the MoE layer above. Bundle means coupling. Unbundling decouples the MoE infrastructure from the specific weight format, which is what makes it potentially reusable.
There is a question I have to answer first, before extracting: do other quantization formats actually need the same MoE tooling, or is the tooling so tightly coupled to TQ3 specifics that it would not be reusable? My current read is the coupling is mostly at the data layout level (TQ3's .tq_packed + .tq_norms format), which the regroup logic translates into vLLM's per-expert convention. The kernel itself (sparse expert iteration) is largely format-agnostic. But this needs to be verified by trying to plug it onto INT4 W4A16, not assumed.
If that verification works, the extraction is the right move. If it does not work (the coupling turns out deeper than I think), the right move is C alone: write up what was learned, let the plugin fade.
Summary
Quantization in mid-2026 splits into three production families plus a research fringe. FP-format (FP8, NVFP4) is hardware-native and owns the production commodity slot, especially on Blackwell. Calibrated integer (GPTQ, AWQ) is the mature stable bucket via Marlin and Machete kernels in vLLM. Rotation-plus-codebook (HIGGS, QuIP#, TurboQuant, ITQ3_S) is the active frontier at 3-bit and below, with no calibration data needed. The sub-1-bit fringe (BTC-LLM and siblings) is interesting for research, not yet production.
The hardware/kernel layer is where the actual moats live: CUDA + Triton for NVIDIA, MLX + Metal for Apple, with mma.sync, cp.async, and CUTLASS forming a stack that does not port directly.
For the open-weight models most relevant to self-hosted agentic coding and long-context inference, MoE is now the dominant shape (DeepSeek-V4, Llama 4, Qwen 3.6 family, GLM-5.1, Kimi K2.6); closed-weight frontier labs still ship dense flagships in parallel. Quantization papers mostly assumed dense models; the MoE infrastructure layer (sparse expert dequant, native-packed loading, dispatch handling) is under-served by the literature and under-served by stock vLLM.
The ecosystem signal that ties it together: native-quantized paths are increasingly shipped as first-class formats alongside BF16 (Kimi K2 family's native INT4 method, parts of DeepSeek family with FP8). The "BF16 source → quantize yourself with my favorite scheme" workflow still works for most models, but the value proposition narrows for the very largest where author-blessed native quantization exists.
My turboquant-vllm plugin sits in the rotation-plus-codebook family. The TQ3 weight scheme is commodity now (HIGGS-scalar reference is in HF Transformers; the vLLM form is my own PR #39970, currently in review). The MoE infrastructure wrapped around it is genuinely unique: 8.41× decode on Qwen3-30B-A3B from sparse expert dequant (PR #33), 10.1× decode on Qwen3.6-35B-A3B from partial-rotary block-diagonal WHT (PR #36), native-packed MoE loader (PR #44).
The decision in front of me is to extract the MoE infrastructure into a separate track, keep TQ3 in maintenance for existing users, and publish the learnings. The current plan is to spend the next 4-6 weeks doing exactly that, with a decision gate at week 6 to either formalize a new project or graceful sunset.
If you took anything from this essay, it should be this: in mature infrastructure work, the durable value is rarely the algorithm. It is the operational discipline around the algorithm, the kernel work that makes it competitive, and the integration layer that lets it survive ecosystem shifts. The math is open; the engineering is not.
Sources
Papers and primary research:
- TurboQuant (Zandieh et al., ICLR 2026): openreview.net/pdf?id=tO3ASKZlok and research.google blog
- HIGGS / Linearity Theorem (Malinovskii et al., NAACL 2025): arxiv.org/abs/2411.17525 and reference implementation in HF Transformers at huggingface.co/docs/transformers/main/en/quantization/higgs
- QuIP# (Cornell): arxiv.org/abs/2402.04396 and github.com/Cornell-RelaxML/quip-sharp
- PolarQuant: arxiv.org/abs/2603.29078
- ITQ3_S (Yoon, March 2026): arxiv.org/abs/2603.27914
- ParoQuant (ICLR 2026): arxiv.org/abs/2511.10645
- ARCQuant: arxiv.org/html/2601.07475v1
- Four Over Six (NVFP4 adaptive block scaling): arxiv.org/pdf/2512.02010
- BTC-LLM: arxiv.org/abs/2506.12040 (also openreview.net/forum?id=yBDBCpEzsO for the ICLR 2026 withdrawn submission)
Hardware and platform references:
- NVFP4 (NVIDIA developer blog): developer.nvidia.com/blog/introducing-nvfp4
- Vera Rubin platform (NVIDIA developer blog): developer.nvidia.com/blog/inside-the-nvidia-rubin-platform
- vLLM Machete kernel (Red Hat developer blog): developers.redhat.com
- FLUTE (Guo et al., 2024) and repo: github.com/HanGuo97/flute
Model releases referenced:
- DeepSeek-V4-Pro: huggingface.co/deepseek-ai/DeepSeek-V4-Pro and api-docs.deepseek.com
- Qwen3.6-35B-A3B: huggingface.co/Qwen/Qwen3.6-35B-A3B
- GLM-5.1: huggingface.co/zai-org/GLM-5.1
- Kimi K2.6: huggingface.co/moonshotai/Kimi-K2.6
- Llama 4 Scout: huggingface.co/meta-llama/Llama-4-Scout-17B-16E
Plugin-side references (turboquant-vllm):
- Upstream vLLM PR #39970 (TurboQuant Linear-only, scalar form of HIGGS): github.com/vllm-project/vllm/pull/39970
- Plugin repo and README (Mac benchmark numbers, kernel details, full PR list): github.com/varjosoft/turboquant-vllm
Comparative landscape coverage referenced in the war stories:
- ITQ3_S consumer Blackwell numbers and Q3_K Metal shader: llama.cpp upstream at github.com/ggerganov/llama.cpp