Validation engagements · open

Quantize, serve, prove it.

Third-party validation of large-model serving on vLLM. Real hardware, reproducible numbers, publishable artifacts. The deliverable is the artifact, not a summary slide.

Send a scoping note → See how it runs

By Hannu Varjoranta · Author of turboquant-vllm · varjosoft on HF

qwen3.6-35b-a3b · validation log

verda.bench qwen3.6-35b-a3b --quant tq3 --gpu a100-80g

[12:42] gate A · smoke ✓ pass · 2m [12:55] gate B · partial ✓ pass · 14m [13:18] gate C · full ✓ pass · 40m review per gate: codex + claude-code → hannu approves

[14:01] gate D · serving ✗ regression eager 9.81 tok/s graphs ~3 tok/s 3.3× slowdown vs expected diagnose: dequant fused outside CUDAGraph patch: torch.library.custom_op wrap ship: turboquant-vllm v0.13.5

[15:30] gate D · serving (retry) ✓ pass · 60m memory 16.2 GB vs 70 GB BF16 throughput 16.0 tok/s graphs on, A100 80GB speedup 1.6× eager v0.13.5 reproduced 4 of 4 runs

→ artifact: huggingface.co/varjosoft/Qwen3.6-35B-A3B-TQ3-native

Who this is for

Numbers people can actually trust.

Three audiences usually walk in:

You shipped an open-weight model with native quant — your team's busy with the next release, but customers are asking for vLLM serving numbers on real hardware. You need a third party to take the release, validate it, and publish the artifact under your or your vendor's name.

You run a GPU cloud or own a hardware platform and need a public benchmark series: six to twelve open-weight models on your SKU, full attribution, MIT-licensed scripts. Sales material that holds up to engineering scrutiny.

You author a quantization method and want independent third-party numbers on more models than you had GPU budget for, before camera-ready.

In each case the deliverable is the artifact, not a slide. Every claim resolves to a checkpoint, an upstream PR, or a writeup you can audit.

How it runs

Four gates. Cheap first, expensive last.

Each engagement runs through the same four gates. The technical pipeline is roughly a week of work; calendar end-to-end (with scoping and report production) is what the engagement shapes describe. Effort and compute both go up at each gate, so most failure modes get caught when GPU spend is still small. You see results at every gate and can stop cleanly if any of them tells you to.

A · SMOKE

Hours

trivial compute

B · PARTIAL

~1 day

single-layer compute

C · FULL

1–2 days

full-model conversion

D · SERVING

3–5 days

benchmark + report

GATE A

Hours · trivial compute

Smoke test

Take ~1% of a single layer, run it through the conversion code, compare to the reference value. Proves the converter understands your release format before any real compute burns.

What you do

Hand over the model URL + source-format spec (HF link, paper, README, or whatever you have).

What you get

A one-page log: parse-tree of the source format, conversion attempt, pass / fail.

What you decide

If Gate A fails, we replan together before any real money is spent. If it passes, we proceed to B.

From the field DeepSeek-V4-Flash: Gate A surfaced five source-format facts before any real compute burned — FP4+FP8 mixed-precision experts, untagged MTP heads, CSA/HCA infrastructure tensors, renamed MLA projections, per-expert layout already supported. Saved 1–3 days of converter rabbit holes.

GATE B

~1 day · single-layer compute

Partial conversion

Convert one full layer end-to-end, load into vLLM, run a 32-token generation, diff the output against a reference run. First real quality signal.

What you do

Confirm the target hardware class and a reference run for comparison (your FP16 baseline, an existing release, or whatever you trust).

What you get

A one-layer vLLM-loadable checkpoint, a diff report against the reference, and any scope-change recommendations (custom kernel work, attention shape, dispatch path).

What you decide

Whether the layer math holds. If it needs custom work, we agree on extra scope or stop. If clean, we proceed to C.

From the field Qwen3.6-A3B partial-rotary attention: Gate B revealed a block-diagonal WHT requirement; we added the scope and filed the upstream vLLM patch alongside the engagement.

GATE C

1–2 days · full-model conversion

Full conversion

The whole model converted. Checkpoint published to HuggingFace under your org (or under varjosoft/ with attribution). File integrity verified, weight stats sane, loadable as a vLLM checkpoint.

What you do

Approve the HF target (private or public) and the model-card draft.

What you get

A working HF model card with downloadable weights, an integrity report, weight-distribution summary, and the conversion repository.

What you decide

Some engagements end here — a clean converted checkpoint is the whole deliverable. Others continue to Gate D for the full serving report.

From the field GLM-5.1-Open: Gate C produced a 309 GB checkpoint that fits on 2×H200 instead of the 8×H200 that BF16 needed. The customer's procurement story changed at this gate.

GATE D

3–5 days · benchmark + report

Serving validation

vLLM load, batch generate, full measurement suite. Memory, throughput (tok/s, TTFT, ITL), quality (PPL, GSM8K, optional 20-scenario judge eval). Honest comparison against FP16 and against publicly available competitors.

What you do

Pick the eval set, the comparison axes that matter for your audience, and any extra scenarios you want stress-tested.

What you get

The actual report — a written artifact with comparison tables, latency / quality / cost graphs, every failure mode encountered, and raw result.json for anyone who wants to re-run.

What you decide

This is the deliverable. From here you publish, you ship, or you make decisions about hardware procurement.

From the field Qwen3.6-35B-A3B: Gate D first revealed a CUDA-graphs-on regression, which was then fixed (turboquant-vllm v0.13.5) and the corrected numbers re-published with the failure mode named.

Engagement shapes

Three rough scopes. Each negotiated individually.

Calendar end-to-end, including scoping at the start and report production at the end. The technical pipeline of four gates above sits inside each of these. Pricing depends on model count, hardware classes, and whether the engagement produces a new checkpoint or only a report. The proposal arrives after a short scoping conversation.

1–2 weeks from €2k

Single-model serving report

The fastest path to a publishable artifact. For model labs and inference teams who need one solid third-party number.

Models1

Hardware1 class

Methods1

You get

Hugging Face model card with converted weights (your org or varjosoft/)
Reproduction repository, MIT-licensed
Written report: per-gate findings, comparison table, failure modes
Public blog post or model-card prose opt

3–4 weeks from €8k

Method validation

For quantization-method authors who want independent third-party numbers on more models than their GPU budget covered, before camera-ready.

Models3–4 representative

Hardware2 classes

Methodsyour method

You get

One HF model card per (model × hardware) pair
Cross-model comparison: where the method wins, where it doesn't
Reproduction repository covering every run
Strengths / weaknesses writeup grounded in measurement
Pre-print contribution or paper appendix opt

5–7 weeks from €15k

Vendor benchmark series

For GPU clouds and hardware vendors who need an ongoing benchmark surface with credible third-party attribution. Often paired with a compute sponsorship.

Models6–8 open-weight

Hardwareyour platform

Methodsseveral, compared

You get

HF model cards published under attribution
Sales-ready summary report with comparison tables and graphs
Per-model writeup framed for your audience
Reproduction repository
Keynote-grade graphs and slide assets opt

After the engagement

If you need someone to keep it running — that's a separate conversation.

Validation gives you the artifact. Operating the model in production is a different shape of work and a different commercial conversation. I can take it on as a follow-on engagement, but the terms depend heavily on three things that vary case by case:

Cloud

Your choice

Verda, Lambda, CoreWeave, Crusoe, AWS, GCP, or your own hardware. Pricing follows.

Geolocation

EU, US, APAC

Latency targets, data-residency requirements, regulatory shape all sit here.

SLA

Best-effort → 99.9

Availability targets, response windows, on-call coverage. Scoped per engagement.

No rate card here on purpose — these dimensions matter too much for a fixed price to make sense. If you want to talk about it, mention "ongoing serving" in your scoping note.

Honest limits

What I will not do.

The lab loses credibility fast if it sells what it can't honestly deliver. Out of scope by design:

Compression record-setting

If raw size is the metric, llama.cpp Q2 wins. I publish the honest comparison table.

Quality guarantees ahead of measurement

The report carries measured PPL, GSM8K, and judge-eval numbers. No threshold I haven't measured.

Beating purpose-built kernels

Marlin, Machete, and FLUTE win on raw throughput. I earn ground on irregular shapes — MoE, MLA, hybrid attention.

Hidden findings

Every failure mode I encounter ends up in the report. The lab's credibility is the honest comparison table.

White-label or anonymous validation

The work is published under varjosoft attribution. That's the entire credibility model.

Hosted inference as a default product

Not on the rate card — see After the engagement above. It's a separate scoping conversation, not a checkbox.

Quantize, serve, prove it.

Numbers people can actually trust.

Four gates. Cheap first, expensive last.

Smoke test

Partial conversion

Full conversion

Serving validation

Three rough scopes. Each negotiated individually.

Single-model serving report

Method validation

Vendor benchmark series

If you need someone to keep it running — that's a separate conversation.

What I will not do.

Compression record-setting

Quality guarantees ahead of measurement

Beating purpose-built kernels

Hidden findings

White-label or anonymous validation

Hosted inference as a default product

Send a paragraph. I'll read it the same day.