Third-party validation of large-model serving on vLLM. Real hardware, reproducible numbers, publishable artifacts. The deliverable is the artifact, not a summary slide.
Three audiences usually walk in:
You shipped an open-weight model with native quant — your team's busy with the next release, but customers are asking for vLLM serving numbers on real hardware. You need a third party to take the release, validate it, and publish the artifact under your or your vendor's name.
You run a GPU cloud or own a hardware platform and need a public benchmark series: six to twelve open-weight models on your SKU, full attribution, MIT-licensed scripts. Sales material that holds up to engineering scrutiny.
You author a quantization method and want independent third-party numbers on more models than you had GPU budget for, before camera-ready.
In each case the deliverable is the artifact, not a slide. Every claim resolves to a checkpoint, an upstream PR, or a writeup you can audit.
Each engagement runs through the same four gates. The technical pipeline is roughly a week of work; calendar end-to-end (with scoping and report production) is what the engagement shapes describe. Effort and compute both go up at each gate, so most failure modes get caught when GPU spend is still small. You see results at every gate and can stop cleanly if any of them tells you to.
Take ~1% of a single layer, run it through the conversion code, compare to the reference value. Proves the converter understands your release format before any real compute burns.
Convert one full layer end-to-end, load into vLLM, run a 32-token generation, diff the output against a reference run. First real quality signal.
The whole model converted. Checkpoint published to HuggingFace under your org (or under varjosoft/ with attribution). File integrity verified, weight stats sane, loadable as a vLLM checkpoint.
vLLM load, batch generate, full measurement suite. Memory, throughput (tok/s, TTFT, ITL), quality (PPL, GSM8K, optional 20-scenario judge eval). Honest comparison against FP16 and against publicly available competitors.
result.json for anyone who wants to re-run.Calendar end-to-end, including scoping at the start and report production at the end. The technical pipeline of four gates above sits inside each of these. Pricing depends on model count, hardware classes, and whether the engagement produces a new checkpoint or only a report. The proposal arrives after a short scoping conversation.
The fastest path to a publishable artifact. For model labs and inference teams who need one solid third-party number.
varjosoft/)For quantization-method authors who want independent third-party numbers on more models than their GPU budget covered, before camera-ready.
For GPU clouds and hardware vendors who need an ongoing benchmark surface with credible third-party attribution. Often paired with a compute sponsorship.
Validation gives you the artifact. Operating the model in production is a different shape of work and a different commercial conversation. I can take it on as a follow-on engagement, but the terms depend heavily on three things that vary case by case:
No rate card here on purpose — these dimensions matter too much for a fixed price to make sense. If you want to talk about it, mention "ongoing serving" in your scoping note.
The lab loses credibility fast if it sells what it can't honestly deliver. Out of scope by design:
If raw size is the metric, llama.cpp Q2 wins. I publish the honest comparison table.
The report carries measured PPL, GSM8K, and judge-eval numbers. No threshold I haven't measured.
Marlin, Machete, and FLUTE win on raw throughput. I earn ground on irregular shapes — MoE, MLA, hybrid attention.
Every failure mode I encounter ends up in the report. The lab's credibility is the honest comparison table.
The work is published under varjosoft attribution. That's the entire credibility model.
Not on the rate card — see After the engagement above. It's a separate scoping conversation, not a checkbox.
Use the contact form below. A reply lands within two business days; if the work is a fit, the next step is a 20-minute call.