Validation engagements · open

Quantize, serve, prove it.

Third-party validation of large-model serving on vLLM. Real hardware, reproducible numbers, publishable artifacts. The deliverable is the artifact, not a summary slide.

qwen3.6-35b-a3b · validation log
verda.bench qwen3.6-35b-a3b --quant tq3 --gpu a100-80g
[12:42] gate A · smoke ✓ pass · 2m [12:55] gate B · partial ✓ pass · 14m [13:18] gate C · full ✓ pass · 40m review per gate: codex + claude-code → hannu approves
[14:01] gate D · serving ✗ regression eager 9.81 tok/s graphs ~3 tok/s 3.3× slowdown vs expected diagnose: dequant fused outside CUDAGraph patch: torch.library.custom_op wrap ship: turboquant-vllm v0.13.5
[15:30] gate D · serving (retry) ✓ pass · 60m memory 16.2 GB vs 70 GB BF16 throughput 16.0 tok/s graphs on, A100 80GB speedup 1.6× eager v0.13.5 reproduced 4 of 4 runs
→ artifact: huggingface.co/varjosoft/Qwen3.6-35B-A3B-TQ3-native
The track record
6
production-grade TQ3 checkpoints across Gemma 4, GLM, and Qwen 3.6 families  → HF org
3
upstream vLLM PRs — #38479 KV cache compression (contributor), #39970 turboquant weight path, #40542 A100 MoE tuning
4
hardware classes covered: A100, H100, RTX PRO Blackwell, M4 Pro Mac
artifacts public, reproducible, and audit-ready by design  → writing
Who this is for

Numbers people can actually trust.

Three audiences usually walk in:

You shipped an open-weight model with native quant — your team's busy with the next release, but customers are asking for vLLM serving numbers on real hardware. You need a third party to take the release, validate it, and publish the artifact under your or your vendor's name.

You run a GPU cloud or own a hardware platform and need a public benchmark series: six to twelve open-weight models on your SKU, full attribution, MIT-licensed scripts. Sales material that holds up to engineering scrutiny.

You author a quantization method and want independent third-party numbers on more models than you had GPU budget for, before camera-ready.

In each case the deliverable is the artifact, not a slide. Every claim resolves to a checkpoint, an upstream PR, or a writeup you can audit.

How it runs

Four gates. Cheap first, expensive last.

Each engagement runs through the same four gates. The technical pipeline is roughly a week of work; calendar end-to-end (with scoping and report production) is what the engagement shapes describe. Effort and compute both go up at each gate, so most failure modes get caught when GPU spend is still small. You see results at every gate and can stop cleanly if any of them tells you to.

A · SMOKE
Hours
trivial compute
B · PARTIAL
~1 day
single-layer compute
C · FULL
1–2 days
full-model conversion
D · SERVING
3–5 days
benchmark + report
GATE A
Hours · trivial compute

Smoke test

Take ~1% of a single layer, run it through the conversion code, compare to the reference value. Proves the converter understands your release format before any real compute burns.

What you do
Hand over the model URL + source-format spec (HF link, paper, README, or whatever you have).
What you get
A one-page log: parse-tree of the source format, conversion attempt, pass / fail.
What you decide
If Gate A fails, we replan together before any real money is spent. If it passes, we proceed to B.
From the field DeepSeek-V4-Flash: Gate A surfaced five source-format facts before any real compute burned — FP4+FP8 mixed-precision experts, untagged MTP heads, CSA/HCA infrastructure tensors, renamed MLA projections, per-expert layout already supported. Saved 1–3 days of converter rabbit holes.
GATE B
~1 day · single-layer compute

Partial conversion

Convert one full layer end-to-end, load into vLLM, run a 32-token generation, diff the output against a reference run. First real quality signal.

What you do
Confirm the target hardware class and a reference run for comparison (your FP16 baseline, an existing release, or whatever you trust).
What you get
A one-layer vLLM-loadable checkpoint, a diff report against the reference, and any scope-change recommendations (custom kernel work, attention shape, dispatch path).
What you decide
Whether the layer math holds. If it needs custom work, we agree on extra scope or stop. If clean, we proceed to C.
From the field Qwen3.6-A3B partial-rotary attention: Gate B revealed a block-diagonal WHT requirement; we added the scope and filed the upstream vLLM patch alongside the engagement.
GATE C
1–2 days · full-model conversion

Full conversion

The whole model converted. Checkpoint published to HuggingFace under your org (or under varjosoft/ with attribution). File integrity verified, weight stats sane, loadable as a vLLM checkpoint.

What you do
Approve the HF target (private or public) and the model-card draft.
What you get
A working HF model card with downloadable weights, an integrity report, weight-distribution summary, and the conversion repository.
What you decide
Some engagements end here — a clean converted checkpoint is the whole deliverable. Others continue to Gate D for the full serving report.
From the field GLM-5.1-Open: Gate C produced a 309 GB checkpoint that fits on 2×H200 instead of the 8×H200 that BF16 needed. The customer's procurement story changed at this gate.
GATE D
3–5 days · benchmark + report

Serving validation

vLLM load, batch generate, full measurement suite. Memory, throughput (tok/s, TTFT, ITL), quality (PPL, GSM8K, optional 20-scenario judge eval). Honest comparison against FP16 and against publicly available competitors.

What you do
Pick the eval set, the comparison axes that matter for your audience, and any extra scenarios you want stress-tested.
What you get
The actual report — a written artifact with comparison tables, latency / quality / cost graphs, every failure mode encountered, and raw result.json for anyone who wants to re-run.
What you decide
This is the deliverable. From here you publish, you ship, or you make decisions about hardware procurement.
From the field Qwen3.6-35B-A3B: Gate D first revealed a CUDA-graphs-on regression, which was then fixed (turboquant-vllm v0.13.5) and the corrected numbers re-published with the failure mode named.
Engagement shapes

Three rough scopes. Each negotiated individually.

Calendar end-to-end, including scoping at the start and report production at the end. The technical pipeline of four gates above sits inside each of these. Pricing depends on model count, hardware classes, and whether the engagement produces a new checkpoint or only a report. The proposal arrives after a short scoping conversation.

1–2 weeks from €2k

Single-model serving report

The fastest path to a publishable artifact. For model labs and inference teams who need one solid third-party number.

Models1
Hardware1 class
Methods1
You get
  • Hugging Face model card with converted weights (your org or varjosoft/)
  • Reproduction repository, MIT-licensed
  • Written report: per-gate findings, comparison table, failure modes
  • Public blog post or model-card prose opt
3–4 weeks from €8k

Method validation

For quantization-method authors who want independent third-party numbers on more models than their GPU budget covered, before camera-ready.

Models3–4 representative
Hardware2 classes
Methodsyour method
You get
  • One HF model card per (model × hardware) pair
  • Cross-model comparison: where the method wins, where it doesn't
  • Reproduction repository covering every run
  • Strengths / weaknesses writeup grounded in measurement
  • Pre-print contribution or paper appendix opt
5–7 weeks from €15k

Vendor benchmark series

For GPU clouds and hardware vendors who need an ongoing benchmark surface with credible third-party attribution. Often paired with a compute sponsorship.

Models6–8 open-weight
Hardwareyour platform
Methodsseveral, compared
You get
  • HF model cards published under attribution
  • Sales-ready summary report with comparison tables and graphs
  • Per-model writeup framed for your audience
  • Reproduction repository
  • Keynote-grade graphs and slide assets opt
After the engagement

If you need someone to keep it running — that's a separate conversation.

Validation gives you the artifact. Operating the model in production is a different shape of work and a different commercial conversation. I can take it on as a follow-on engagement, but the terms depend heavily on three things that vary case by case:

Cloud
Your choice
Verda, Lambda, CoreWeave, Crusoe, AWS, GCP, or your own hardware. Pricing follows.
Geolocation
EU, US, APAC
Latency targets, data-residency requirements, regulatory shape all sit here.
SLA
Best-effort → 99.9
Availability targets, response windows, on-call coverage. Scoped per engagement.

No rate card here on purpose — these dimensions matter too much for a fixed price to make sense. If you want to talk about it, mention "ongoing serving" in your scoping note.

Honest limits

What I will not do.

The lab loses credibility fast if it sells what it can't honestly deliver. Out of scope by design:

Compression record-setting

If raw size is the metric, llama.cpp Q2 wins. I publish the honest comparison table.

Quality guarantees ahead of measurement

The report carries measured PPL, GSM8K, and judge-eval numbers. No threshold I haven't measured.

Beating purpose-built kernels

Marlin, Machete, and FLUTE win on raw throughput. I earn ground on irregular shapes — MoE, MLA, hybrid attention.

Hidden findings

Every failure mode I encounter ends up in the report. The lab's credibility is the honest comparison table.

White-label or anonymous validation

The work is published under varjosoft attribution. That's the entire credibility model.

Hosted inference as a default product

Not on the rate card — see After the engagement above. It's a separate scoping conversation, not a checkbox.

How to reach me

Send a paragraph. I'll read it the same day.

Use the contact form below. A reply lands within two business days; if the work is a fit, the next step is a 20-minute call.

Cover these four, so the first reply isn't five questions:
  • Model — HF link, or the release you want validated
  • Hardware — A100 80GB, H100, Blackwell, Mac, or your own
  • Deliverable — public report, internal benchmark, or both
  • Budget range — even rough; saves a back-and-forth on fit
Yours, Hannu Varjoranta Varjosoft Oy · Helsinki, Finland