Custom AI agents · taking work

Agents you can prove,
not just ship.

Most agents reach production on a promise and a demo that never saw real traffic. I build agents for one job and ship them with their own evidence: an eval harness that hunts for the conversations you would never script, judged by a model that does not share the agent's blind spots. You deploy knowing where it breaks.

build · verify · run log
varjo build agent --for "support triage" --eval-harden
[1] scaffolded for one job ✓ scoped [2] adversarial corpus ✓ 5,000 dialogues [3] independent judge ✓ different family [4] holds under repetition ✓ pass^8 · no drift [5] failure modes ✓ 3 named · not buried
shipped with its evidence. re-run any of it yourself.
What is already running
live
Dots: the agent answering in the corner of this page, grounded only in what is written
2
agents you can poke in public today: this page, and cloop.io
pass^k
every agent proven to hold under repetition before it ships, not on one lucky run
1000+
patterns the eval judgment draws on, kept honest against real usage
01 · What I build

Built for one job, shipped with their own evidence.

I do not sell a generic agent. I build one that does a specific job well, then ship it with the harness that tested it, so the thing you deploy carries its own proof. Support triage, research synthesis, an internal copilot, a document workflow. Whatever the job, the deliverable is the agent and the reproduction that shows it holds, in your hands, not a slide about quality.

The discipline comes from building my own. Spegling and Dots run on the same verification I would put around yours: an agent earns the right to ship by surviving the conversations it will actually face, not by passing the demo.

I also work below the agent, at the model layer: quantization, serving on real hardware, kv-cache. If your problem lives down there, the writing goes deep. Read the technical work →

02 · How I prove it

Your agent's worst conversations are the ones you never read.

Put an agent in front of people and you create a surface you cannot watch: thousands of unscripted conversations where it can promise something false, break tone, misstate a policy, miss an escalation, or qualify the wrong lead. Pre-launch tests never saw that traffic. A latency dashboard cannot tell you it just said something untrue. So before an agent ships, I try to break it the way reality will.

01
Build
scoped to one job
02
Eval-harden
adversarial corpus + your traffic
03
Prove it holds
pass^k · no silent drop
04
Ship
with its evidence pack
GATE 01 coverage

Conversations you would never script

I generate thousands of adversarial dialogues across the ways an agent actually fails: hostile users, edge-case policy, ambiguous asks, prompt injection, the question asked sideways. Once it is live, your real traffic feeds the same harness.

example A customer asks for a refund the policy forbids, twice, politely, then angrily. Does the agent hold the line without inventing an exception?
GATE 02 independence

A judge that does not share the blind spot

Every conversation is scored by a model from a different family than the one that produced it. An auditor that thinks like the agent misses exactly what the agent missed. That, and having no stake in the score, is what an eval you run on your own model cannot give you.

why it matters Agreement between two models with the same training blind spot is not evidence. Independence of the judge is the load-bearing property.
GATE 03 reliability

It has to hold more than once

A single good run is luck. I check the same behaviour under repetition, pass^k, so a fix that works once but quietly regresses does not slip through. This is the same gate I put on my own coding agents before a change is allowed to land.

example Eight runs of the same hard case. Seven pass, one leaks the wrong policy. That is a fail, not a 7/8.
GATE 04 honesty

Failure modes named, not buried

You get the failures written down and ranked by what they cost, with a reproduction you can re-run. Where it breaks is the deliverable, not the part I hide. If a job cannot be verified to a standard I would stand behind, I tell you that instead of shipping.

you receive A dashboard, a trend that survives a skeptic, and an MIT-licensed repo that reproduces every number.
The exhibit

The proof is running on this page, not described.

The widget in the corner is an agent I built. It waits for real attention, answers only from what is written here, and refuses when a question walks off the page. It is eval-hardened with the same harness above, scored against a corpus I keep honest. Spend a few minutes, then press it and try to make it say something it should not.

Grounded
Answers from the page
no invented facts
Patient
Waits for real attention
a floor in seconds, not a popup
Honest
Refuses off-page questions
go ahead, try to break it
Engagement shapes

Three ways to start.

Fixed scope, fixed price, no discovery deck. Tell me the job and I come back with a number.

one-timefrom €15,000

Build

I scope and build a custom agent for one job, eval-harden it, and hand it over with its reproduction. You own what I build.

Timeline3–6 weeks
You bringthe job, sample data, the line it must not cross
You get
  • The agent, doing one job well
  • The eval harness that tested it
  • Failure modes named and ranked
  • A reproduction repo you can re-run
one-timefrom €6,000

Prove yours

You already run an agent. I bring the harness to it: an adversarial corpus, an independent judge, and a report of where it breaks and what that costs.

Timeline~1–2 weeks
You bringan endpoint and an invite code, not your pipeline
You get
  • A dashboard and a trend
  • Ranked failure modes, with examples
  • The reproduction, yours to keep
monthlyfrom €2,000/mo

Keep it honest

An agent drifts as you change it. The corpus stays current, the regression trend is maintained, and we sit down each quarter. The relationship kept warm.

Cadencecontinuous corpus, quarterly review
Best aftera Build or a Prove engagement
You get
  • Regression caught as it appears
  • A corpus that learns from real traffic
  • One person who knows your agent
Honest limits

Not generic agent building.

An agent ships with its evidence or it does not ship. If a job cannot be verified to a standard I would stand behind, I will say so rather than pretend.

Not a platform you log into.

A done-for-you engagement. You own the agent and the reproduction. Nothing to subscribe to, nothing held hostage on my servers.

One person, on purpose.

I take few engagements at a time so each gets real attention. If I am full, I will tell you, and tell you when I am not.

Not a notified body.

I produce the technical evidence that feeds your conformity and due-diligence work. I do not issue certificates, and I will not imply I can.

How to start

One conversation, then a scoped engagement.

Tell me what you want the agent to do, and what it must never get wrong. I come back with a scope and a fixed price, not a discovery call.

Bring three things
  • The job. What the agent decides or produces.
  • The line. What it must never cross.
  • The traffic. Who talks to it, and how often.

Built and verified by one person, in Helsinki. Hannu Varjoranta Varjosoft Oy · hannu@varjosoft.com

Dots · this page
reading with you