Confidence is enough to decide. It's not enough to do.

A classifier confidence of 0.99 is enough to decide a tier. It is not enough to send an email you can't unsend.

Those are two different bars, and most "autonomous" systems use the first one to clear the second. That's the bug.

This is the third post in a series that started as a cheap-model brag and turned into an architecture argument. Post one: a cheap model beat GPT-4o on email triage. Post two: the model only scores four features, and a deterministic rule picks the tier. A commenter, @hannune, pointed at one of those four features:

Your reversibility signal is something I have not seen named explicitly before but it is exactly the right axis for anything that touches irreversible state.

He's right, and it's the cleanest way into the last piece of the design. So: what reversibility actually routes.

A classifier confidence of 0.99 is enough to decide a tier. It is not enough to send an email you can't unsend.

Those are two different bars, and most "autonomous" systems use the first one to clear the second. That's the bug.

Your reversibility signal is something I have not seen named explicitly before but it is exactly the right axis for anything that touches irreversible state.

He's right, and it's the cleanest way into the last piece of the design. So: what reversibility actually routes.

Confidence is enough to decide. It's not enough to do.

Other newsrooms on this story

Confidence is enough to decide. It's not enough to do.

Other newsrooms on this story

Related reading

I don't trust the LLM to classify my email. So I don't let it.

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

I let GPT-4o and a cheaper model fight over my inbox. GPT-4o lost.

Three Failures My AI Memory System Tested — And the Flaw It Revealed in Itself

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Bootstrap confidence intervals for your LLM eval metrics

Related reading

I don't trust the LLM to classify my email. So I don't let it.

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

I let GPT-4o and a cheaper model fight over my inbox. GPT-4o lost.

Three Failures My AI Memory System Tested — And the Flaw It Revealed in Itself

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Bootstrap confidence intervals for your LLM eval metrics