Agents you can prove,
not just ship.
Most agents reach production on a promise and a demo that never saw real traffic. I build agents for one job and ship them with their own evidence: an eval harness that hunts for the conversations you would never script, judged by a model that does not share the agent's blind spots. You deploy knowing where it breaks.
Built for one job, shipped with their own evidence.
I do not sell a generic agent. I build one that does a specific job well, then ship it with the harness that tested it, so the thing you deploy carries its own proof. Support triage, research synthesis, an internal copilot, a document workflow. Whatever the job, the deliverable is the agent and the reproduction that shows it holds, in your hands, not a slide about quality.
The discipline comes from building my own. Spegling and Dots run on the same verification I would put around yours: an agent earns the right to ship by surviving the conversations it will actually face, not by passing the demo.
I also work below the agent, at the model layer: quantization, serving on real hardware, kv-cache. If your problem lives down there, the writing goes deep. Read the technical work →
Your agent's worst conversations are the ones you never read.
Put an agent in front of people and you create a surface you cannot watch: thousands of unscripted conversations where it can promise something false, break tone, misstate a policy, miss an escalation, or qualify the wrong lead. Pre-launch tests never saw that traffic. A latency dashboard cannot tell you it just said something untrue. So before an agent ships, I try to break it the way reality will.
Conversations you would never script
I generate thousands of adversarial dialogues across the ways an agent actually fails: hostile users, edge-case policy, ambiguous asks, prompt injection, the question asked sideways. Once it is live, your real traffic feeds the same harness.
A judge that does not share the blind spot
Every conversation is scored by a model from a different family than the one that produced it. An auditor that thinks like the agent misses exactly what the agent missed. That, and having no stake in the score, is what an eval you run on your own model cannot give you.
It has to hold more than once
A single good run is luck. I check the same behaviour under repetition, pass^k, so a fix that works once but quietly regresses does not slip through. This is the same gate I put on my own coding agents before a change is allowed to land.
Failure modes named, not buried
You get the failures written down and ranked by what they cost, with a reproduction you can re-run. Where it breaks is the deliverable, not the part I hide. If a job cannot be verified to a standard I would stand behind, I tell you that instead of shipping.
The proof is running on this page, not described.
The widget in the corner is an agent I built. It waits for real attention, answers only from what is written here, and refuses when a question walks off the page. It is eval-hardened with the same harness above, scored against a corpus I keep honest. Spend a few minutes, then press it and try to make it say something it should not.
Three ways to start.
Fixed scope, fixed price, no discovery deck. Tell me the job and I come back with a number.
Build
I scope and build a custom agent for one job, eval-harden it, and hand it over with its reproduction. You own what I build.
- The agent, doing one job well
- The eval harness that tested it
- Failure modes named and ranked
- A reproduction repo you can re-run
Prove yours
You already run an agent. I bring the harness to it: an adversarial corpus, an independent judge, and a report of where it breaks and what that costs.
- A dashboard and a trend
- Ranked failure modes, with examples
- The reproduction, yours to keep
Keep it honest
An agent drifts as you change it. The corpus stays current, the regression trend is maintained, and we sit down each quarter. The relationship kept warm.
- Regression caught as it appears
- A corpus that learns from real traffic
- One person who knows your agent
Not generic agent building.
An agent ships with its evidence or it does not ship. If a job cannot be verified to a standard I would stand behind, I will say so rather than pretend.
Not a platform you log into.
A done-for-you engagement. You own the agent and the reproduction. Nothing to subscribe to, nothing held hostage on my servers.
One person, on purpose.
I take few engagements at a time so each gets real attention. If I am full, I will tell you, and tell you when I am not.
Not a notified body.
I produce the technical evidence that feeds your conformity and due-diligence work. I do not issue certificates, and I will not imply I can.
One conversation, then a scoped engagement.
Tell me what you want the agent to do, and what it must never get wrong. I come back with a scope and a fixed price, not a discovery call.
- The job. What the agent decides or produces.
- The line. What it must never cross.
- The traffic. Who talks to it, and how often.
Built and verified by one person, in Helsinki. Hannu Varjoranta Varjosoft Oy · hannu@varjosoft.com