~4x More Concurrent Users on the Same GPU

When you self-host large language models, the KV cache, not the model weights, becomes the memory bottleneck at production context lengths. Each concurrent conversation holds its own KV cache in GPU memory. Compress the cache and you serve more users on the same hardware.

I built turboquant-vllm, a vLLM integration for TurboQuant+ KV cache compression. It compresses the KV cache to 26% of its original size with no observed quality degradation. The library includes fused CUDA kernels, installs from pip, and works as a drop-in patch for any vLLM model, including both standard FlashAttention (Qwen3, Llama, Mistral) and Multi-head Latent Attention models (GLM-4.7-Flash, DeepSeek-V3).

Tested across 10 model configurations on Verda GPU cloud in Helsinki. TQ+ matched or beat the baseline on every config. Asymmetric K4/V3 scored highest overall. On MLA models, TQ+ works correctly where vLLM's built-in FP8 KV cache does not.

Why self-host in Europe

I build conversational AI products at Varjosoft. The models I want, GLM-4.7 (top-ranked open model on Chatbot Arena), DeepSeek-V3, Qwen3-235B, are not available through the API providers I use. I need to self-host, and I want to do it on European infrastructure for data residency.

Verda (formerly DataCrunch) is a GPU cloud provider based in Finland. A100 80GB at $1.29/hr, standard CUDA, standard vLLM, OpenAI-compatible API. No vendor lock-in. The same Docker container runs on any GPU provider.

The constraint: these models are enormous. At production context lengths, the KV cache becomes the memory bottleneck. At 32K context with 32 layers and 32 KV heads (illustrative; models like GLM-4.7 with 96 KV heads have proportionally larger caches), the FP16 KV cache alone is ~17 GB per user. That is memory that cannot hold concurrent conversations.

The benchmark

15 model configurations, 20 multi-turn conversation scenarios, scored by an independent judge LLM (Llama-3.3-70B on Nebius).

Models under test

Model	Size	GPUs	$/hr
GLM-4.7 (355B MoE)	196GB Q4	4x A100	$5.16
GLM-4.7-Flash	58GB BF16	1x A100	$1.29
Qwen3-235B (235B MoE)	115GB AWQ	2x A100	$2.58
Qwen3-30B (30B MoE)	16GB AWQ	1x A100	$1.29
DeepSeek-V3 (671B MoE)	~320GB FP8	4x A100	$5.16

Each model tested in multiple KV cache configurations: FP16 (baseline), FP8, TQ+ turbo4 (3.8x compression), TQ+ turbo3 (4.6x), and asymmetric K4/V3.

Scenarios: Product inquiries, technical support, personality consistency, reasoning depth, prompt injection resistance, creative communication, and multilingual (English, Finnish, Swedish with mid-conversation language switching). Each conversation is 3-4 turns with a simulated visitor, scored 1-5 on criteria specific to each scenario.

Results: quality preserved across 10 configs

10 configurations tested on H100 80GB and A100 80GB on Verda (Helsinki). 20 multi-turn conversation scenarios scored by Llama-3.3-70B judge:

Model	KV Cache	Avg Score	Latency
Qwen3-235B AWQ	TQ+ asymmetric K4/V3	4.75	28537ms
Qwen3-235B AWQ	TQ+ turbo4	4.74	29063ms
Qwen3-235B AWQ	FP16 (baseline)	4.74	29415ms
Qwen3-235B AWQ	FP8	4.71	29971ms
Qwen3-30B FP16	FP16 (baseline)	4.73	4396ms
Qwen3-30B AWQ	FP16	4.67	3721ms
GLM-4.7-Flash BF16	TQ+ turbo3	4.63	5998ms
GLM-4.7-Flash BF16	FP16 (baseline)	4.61	6042ms
GLM-4.7-Flash BF16	TQ+ turbo4	4.58	5998ms
GLM-4.7-Flash BF16	FP8	1.07	6299ms

TQ+ matched or beat the baseline on every model. Asymmetric K4/V3 scored highest on Qwen3-235B (4.75) with better compression than symmetric turbo4. This confirms the turboquant_plus research that K precision dominates quality.

The practical impact: the KV cache shrinks to 26% of FP16 size. That is ~3.8x more concurrent conversations at the same context length, or the same conversations with ~3.8x longer context, on the same GPU, at the same cost.

GLM-4.7 355B and DeepSeek-V3 671B benchmarks pending (require larger disk provisioning).

What is TurboQuant+

TurboQuant (Zandieh et al., 2025) is a data-oblivious vector quantization algorithm. After a random rotation, vector coordinates follow a known Gaussian distribution, so you can use precomputed optimal centroids instead of learning them from data. No calibration dataset, no per-model tuning, works instantly on any model.

TurboQuant+ extends this for KV cache compression with two key ideas:

Separate algorithms for K and V. K cache controls attention routing (Q @ K^T), so it needs inner product preservation. V cache is a weighted sum. MSE preservation suffices.

K cache: PolarQuant at (b-1) bits + QJL (1 bit) = b bits total
V cache: PolarQuant MSE-only at full b bits

Asymmetric K/V bit widths. K precision dominates quality. It controls which tokens attend to which. V only affects how much. On quantized weight models, symmetric turbo3 (3-bit K + 3-bit V) gives PPL 3556 (catastrophic), but asymmetric K4/V3 works fine. This is not an optimization. It is a correctness requirement.

I previously published turbo-quant-lite for embedding compression in PostgreSQL. Same mathematical foundation, different algorithm and codebook. Stored embeddings have different statistical properties than per-layer KV cache vectors.

The CUDA integration

vLLM's KV cache pipeline runs through CUDA kernels. Injecting Python adds 20-40% latency overhead. I built four fused CUDA kernels in turboquant-vllm that do the compression at the CUDA level:

Kernel	Purpose
`reshape_and_cache_kernel`	Write: norm, WHT rotate, quantize, 4-bit pack, paged cache
`dequant_paged_kernel`	Read: unpack, centroid lookup, inverse WHT, rescale, fp16
`qjl_quantize_residual_kernel`	K cache: PolarQuant residual, QJL projection, sign bits
`qjl_dequantize_and_add_kernel`	K cache: QJL reconstruction added to PolarQuant output

The Walsh-Hadamard Transform rotation runs in O(d log d). 896 FLOPs vs 16,384 for dense rotation at d=128. Fits entirely in shared memory with one thread per element per stage.

The kernels use separate constant memory for K and V codebooks (required for asymmetric bit widths), and pack two 4-bit indices per byte to halve cache bandwidth.

Bandwidth math at 32K context

	FP16	Turbo4
KV cache size	17.2 GB	4.6 GB
Read time at 2TB/s (A100)	8.6 ms	2.3 ms
Dequant overhead	0	~0.2 ms
Net per decode step	8.6 ms	2.5 ms

Assumes 32 layers, 32 KV heads, head_dim=128 as an illustrative example. Note that Qwen3-235B uses only 4 KV heads (GQA), so its per-user cache is much smaller than this table suggests. Models with fewer KV heads have proportionally smaller caches, but the 3.8x compression ratio holds regardless. Actual gains depend on paged attention memory access patterns.

What I discovered

TQ+ works on MLA, FP8 does not

GLM-4.7-Flash and DeepSeek-V3 use Multi-head Latent Attention (MLA) instead of standard FlashAttention. Rather than storing separate K and V vectors, MLA stores a single compressed latent vector (kv_c_normed) plus a small positional encoding (k_pe). The model reconstructs K and V on the fly from the latent vector when needed.

TurboQuant+ compresses this latent vector the same way it compresses standard K/V: normalize, rotate, quantize. The patch targets MLACommonImpl, which covers all MLA backends (FLASHMLA, TRITON_MLA, FLASH_ATTN_MLA). Validated on GLM-4.7-Flash across 20 scenarios: TQ+ turbo3 scored 4.63 vs 4.61 baseline.

vLLM's built-in FP8 KV cache is broken on MLA models. Single-turn responses are coherent, but multi-turn conversations degrade to garbage. The root cause: the FLASHMLA backend applies FP8 without proper per-tensor scaling, and quantization error compounds with context length. TQ+ does not have this problem because PolarQuant normalizes each vector independently. Filed as vLLM issue #38652.

MoE memory trap

GLM-4.7 is 355B total parameters with a fraction active per token (typical for MoE). My initial VRAM estimate: ~100GB at Q4. The actual requirement: ~178GB. vLLM loads all expert weights into VRAM even though only a subset activates per token. The ~100GB estimate would be correct for the active parameter count, not the full model. This pushed GLM-4.7 from 2x to 4x A100.

Pre-quantized model availability

Newer open models often lack official GPTQ/AWQ variants:

GLM-4.7: BF16 only (~710GB). Community GPTQ from QuantTrio at 196GB.
Qwen3-30B-AWQ: Not from the Qwen org. Community version from QuixiAI.
DeepSeek-V3: No widely adopted or officially supported GPTQ releases. Used vLLM's native FP8 quantization.

Community quantizations fill gaps but may have different quality characteristics than official releases.

What this means in production

For conversational AI products, the bottleneck is concurrent conversations. A customer service chatbot might handle 50 simultaneous conversations during peak hours. Each conversation holds its KV cache in GPU memory for the duration.

The per-user memory cost drops 3.8x. For models with many KV heads, like GLM-4.7 (96 heads), the KV cache dominates VRAM at long context, and this compression directly translates to more concurrent users. For efficient GQA models like Qwen3-30B (4 KV heads) or Qwen3-235B (4 KV heads), the per-user cache is already small, but the savings still free up VRAM for longer context windows or more concurrent sequences.

This shifts the economics. Instead of scaling by adding GPUs ($1.29/hr each), you scale by compressing the cache. For bursty workloads, a customer support bot busy during business hours and idle overnight, the savings compound.

Serverless GPU: the next step

The current setup runs on dedicated Verda instances. You pay whether the GPU is busy or idle. For production, Verda also offers serverless containers that scale to zero when not in use. For a conversational AI product with business-hours traffic, this could cut costs dramatically: no overnight idle charges, automatic scaling during peaks.

The trade-off is cold start time. Loading a 16GB model into VRAM takes 1-2 minutes. For a customer service bot where the first message of the day can wait, this is acceptable. For latency-critical applications, a warm instance is still better.

I am planning to test the serverless setup next. Same vLLM + TQ+ configuration, packaged as a container, deployed on Verda's serverless GPU infrastructure. If the cold start is reasonable, this becomes the production target.

What is next

Remaining model benchmarks. GLM-4.7 355B and DeepSeek-V3 671B on larger instances.
Verda serverless containers. Scale-to-zero GPU serving for production workloads.
Native vLLM backend. Contributing to PR #38479 to add --kv-cache-dtype turbo4 without monkey-patching. Posted benchmark results and MLA findings on the feature request thread.
FP8 MLA fix. Investigating the root cause of FP8 KV cache failure on MLA models.

Infrastructure

GPU cloud: Verda. Helsinki, Finland. A100 80GB at $1.29/hr, H100 80GB at $2.29/hr. EU data residency.
Inference: vLLM 0.18.1, OpenAI-compatible API.
Judge model: Llama-3.3-70B on Nebius. Held constant across all configs.
Budget: ~$98 for 10 configs tested so far.

turboquant-vllm turbo-quant-lite on PyPI TurboQuant+ research TurboQuant paper (arXiv, 2025)