Useful AI · proven independently

AI your organisation can actually use.

In a lot of places the AI would help, but nothing can answer for what it does, so the answer stays no. I bring the independent evidence and the boundaries that turn that no into a yes. An agent taking support, a copilot inside a team, a document workflow, a model on your own hardware: whatever the useful form, you put it to work knowing where it breaks and able to prove it.

Start a conversation → See it running

Helsinki. Built by one person. Proven before you trust it.

build · verify · run log

varjo prove ai-system --for "support triage" --independent

[1] scoped to one job ✓ bounded [2] adversarial corpus ✓ 5,000 cases [3] independent judge ✓ different family [4] holds under repetition ✓ pass^8 · no drift [5] failure modes ✓ 3 named · not buried

put to work with its evidence. re-run any of it yourself.

What is already running

live

Dots: the agent answering in the corner of this page, grounded only in what is written

agents you can poke in public today: this page, and cloop.io

pass^k

everything I ship proven to hold under repetition, not on one lucky run

1000+

patterns the eval judgment draws on, kept honest against real usage

01 · What I do

Get useful AI into real work, with its own evidence.

It starts before the tooling. If your organisation wants to use AI but is not sure where it fits, or what is safe to hand it, I help you find that first: where it earns its place, what is blocking it and why, and a plan to get there. Then the hands-on work. I prove an AI system you already run, build one for a specific job, or go down to the model layer and run capable models on your own hardware. Support triage, research synthesis, an internal copilot, a document workflow, a model you host yourself: whatever the useful form, the deliverable is the thing and the reproduction that shows it holds, in your hands, not a slide about quality.

The discipline comes from building my own. This sits on Spegling, the system I build on: one memory the AI carries, a reviewer from a different family, and a ledger that answers for what the AI did. The services are the done-for-you path to the same thing, the reason an organisation that could not touch AI before can start.

The model layer is real work, not a footnote: quantization, serving on real hardware, kv-cache, keeping capable AI inside your own walls. If your problem lives down there, the writing goes deep. Read the technical work →

02 · How I prove it

An AI's worst moments are the ones you never see.

Put AI in front of people or real work and you create a surface you cannot watch: thousands of unscripted moments where it can promise something false, break tone, misstate a policy, miss an escalation, or qualify the wrong lead. Pre-launch tests never saw that traffic. A latency dashboard cannot tell you it just said something untrue. So before it carries real work, I try to break it the way reality will.

Build

scoped to one job

Eval-harden

adversarial corpus + your traffic

Prove it holds

pass^k · no silent drop

Ship

with its evidence pack

GATE 01 coverage

Conversations you would never script

I generate thousands of adversarial cases across the ways AI actually fails in front of real work: hostile users, edge-case policy, ambiguous asks, prompt injection, the question asked sideways. Once it is live, your real traffic feeds the same harness.

example A customer asks for a refund the policy forbids, twice, politely, then angrily. Does the agent hold the line without inventing an exception?

GATE 02 independence

A judge that does not share the blind spot

Every conversation is scored by a model from a different family than the one that produced it. An auditor that thinks like the agent misses exactly what the agent missed. That, and having no stake in the score, is what an eval you run on your own model cannot give you.

why it matters Agreement between two models with the same training blind spot is not evidence. Independence of the judge is the load-bearing property.

GATE 03 reliability

It has to hold more than once

A single good run is luck. I check the same behaviour under repetition, pass^k, so a fix that works once but quietly regresses does not slip through. This is the same gate I put on my own coding agents before a change is allowed to land.

example Eight runs of the same hard case. Seven pass, one leaks the wrong policy. That is a fail, not a 7/8.

GATE 04 honesty

Failure modes named, not buried

You get the failures written down and ranked by what they cost, with a reproduction you can re-run. Where it breaks is the deliverable, not the part I hide. If a job cannot be verified to a standard I would stand behind, I tell you that instead of shipping.

you receive A dashboard, a trend that survives a skeptic, and an MIT-licensed repo that reproduces every number.

The exhibit

The proof is running on this page, not described.

The widget in the corner is an agent I built. It waits for real attention, answers only from what is written here, and refuses when a question walks off the page. It is eval-hardened with the same harness above, scored against a corpus I keep honest. Spend a few minutes, then press it and try to make it say something it should not.

Grounded

Answers from the page

no invented facts

Patient

Waits for real attention

a floor in seconds, not a popup

Honest

Refuses off-page questions

go ahead, try to break it

It runs on the essays too →

Engagement shapes

Where to start.

Four are fixed scope and fixed price, no discovery deck. The fifth runs longer and is shaped in the conversation. Open one for the detail. If none quite fits, that is a conversation too: if AI could help your organisation and you are not sure how, just say so.

one-timefrom €4,000

Adopt

Find where AI earns its place in your organisation, with the boundaries already drawn.

What's included

Timeline~1–2 weeks

You bringyour real work, and the parts you cannot get wrong

You get

Where AI helps, ranked by payoff and risk
What is blocking it, and how to clear it
A roadmap that ends in real use, not a deck

one-timefrom €6,000

Prove yours

Independent evidence on an AI system you already run, so you can defend deploying it.

What's included

Timeline~1–2 weeks

You bringan endpoint and an invite code, not your pipeline

You get

Independent evidence you can hand upward
Ranked failure modes, with examples
The reproduction, yours to keep

one-timefrom €15,000

Build

I build the AI for one job and hand it over proven, with its reproduction. Yours to own.

What's included

Timeline3–6 weeks

You bringthe job, sample data, the line it must not cross

You get

The AI, doing one job well
The eval harness that tested it
Failure modes named and ranked
A reproduction repo you can re-run

monthlyfrom €2,000/mo

Keep it honest

Your AI kept honest as you change it, and as the models underneath it change.

What's included

Cadencecontinuous corpus, quarterly review

Best aftera Build or a Prove engagement

You get

Regression caught as it appears
A corpus that learns from real traffic
One person who knows your system

longerby arrangement

Embedded

Some work does not fit a hand-over. I join your team for a stretch and the AI gets built inside your process, by people who will still be there afterwards.

What's included

Runs formonths rather than weeks, part-time

You bringa team, a backlog, and the parts you cannot get wrong

You get

AI shipped inside your own process, not beside it
Your engineers able to carry it once I stop
Evidence built in as the work happens
Someone who will tell you when the answer is no

The four above carry a fixed price because the work is knowable before it starts. This one is not, so the depth, the cadence and the rate come out of the first conversation and depend on your team. Say what the AI has to do and what it must never get wrong, and I will come back with a shape and a number rather than a discovery phase.

Honest limits

Not generic AI building.

AI ships with its evidence or it does not ship. If a job cannot be verified to a standard I would stand behind, I will say so rather than pretend.

Not a platform you log into.

Done for you, or done alongside your team. Either way you own what I build and the reproduction. Nothing to subscribe to, nothing held hostage on my servers.

One person, on purpose.

I take few engagements at a time so each gets real attention. If I am full, I will tell you, and tell you when I am not.

Not a notified body.

I produce the technical evidence that feeds your conformity and due-diligence work. I do not issue certificates, and I will not imply I can.

A weathered old pine against a dappled sky at Vinbärsudden, Gullö Gård, Finland

How to start

One conversation, then a scoped engagement.

Tell me what you want the AI to do, and what it must never get wrong. I come back with a scope and a fixed price, not a discovery call.

Bring three things

The job. What the AI decides or produces.
The line. What it must never cross.
The traffic. Who or what it touches, and how often.

Built and verified by one person, in Helsinki. Hannu Varjoranta Varjosoft Oy · hannu@varjosoft.com