FAP — Faithful Autonomous Progress
One number, impossible to fabricate, that falls straight out of the loop's own state: how far and how faithfully Sentigent drives a plan before it needs you.
Why a new metric
Every AI tool ships with a number. Most of those numbers are fabricated — a model quality score, a "judgment score," an accuracy percentage measured on a synthetic eval set. They mean nothing about what actually happened in your codebase.
FAP is different. It falls straight out of the loop's own state. You can verify every component yourself: count the steps in your plan, check which ones have verified test runs, count how many times the loop paged you. The math is trivial. The fabrication surface is zero.
This replaces every fabricated "judgment score" we previously published. The only number we report is the one the loop actually produced.
The five axes
What high FAP looks like vs what it doesn't
| Scenario | Distance | Fidelity | FAP | Verdict |
|---|---|---|---|---|
| 12/15 steps, all verified, 0 asks | 80% | 100% | 80% | Good — high distance + perfect fidelity |
| 12/15 steps, all verified, 1 ask | 80% | 100% | 73% | One ask cost 1 step of FAP credit |
| 15/15 steps "done", 7 fail verification | 100% | 53% | 53% | High distance, low fidelity — not faithful |
| 15/15, all verified, 0 asks | 100% | 100% | 100% | Dark factory — what we're building toward |
The real receipt
The receipt below is real. It was produced by running loop_driver receipt
after the loop wrote a pytest suite for itself — a real claude -p run,
real test subprocess, real verification gate. 19 tests pass on re-run independently.
Honest scope
What this receipt proves:
- ✓ Real cross-session resume: the loop state was persisted atomically and picked up after session end.
- ✓ Per-step verify gate: a step was only marked done after the real test subprocess passed.
- ✓ Zero human asks: the loop self-resolved every blocker it encountered.
- ✓ Independent re-verification: 19 tests pass on a clean re-run after the session ended.
What we haven't proven yet: FAP compounding upward across many diverse runs as the learned push-vs-ask judgment improves. That's the frontier we're actively building. We will publish the data when we have it.
How FAP is computed
The receipt is generated by python -m sentigent.operator.loop_driver receipt.
It reads the loop's own state file — the same file that drives resume — and aggregates across all
recorded runs. Nothing is inferred from model outputs; every data point is a logged event
(step started, step verified, blocker raised, human paged).
The formula
The product goal
The product's job is to push FAP and the faithful streak upward over time — as the loop runs more plans, the CloneResolver's learned push-vs-ask thresholds improve, meaning fewer unnecessary asks, fewer cliff-drives, and longer unbroken verified streaks. FAP is the measure of whether that's actually happening. No synthetic eval set. No held-out benchmark. Just: did the loop do the work, verify it, and not need you?