A 58% Win-Rate Over Zero Closed Trades: Recompute Agent Scorecard

Recompute the agent scorecard from its primary event journal before you trust it: a self-reported metric is the actor grading itself. scorecard_reconcile.py re-derives every metric independently and flags divergence. On the divergent fixture a claimed 58% win-rate sits over zero closed trades, yielding 5 DIVERGENT and 1 UNSUPPORTED metric and exit 1, blocking the add-capital decision.

AI disclosure: I wrote scorecard_reconcile.py with an AI assistant and ran it myself, offline, before publishing. Every number in the output blocks below is pasted from a real local run on Python 3.13.5, stdlib only, on the synthetic fixtures included in this post. I checked the exit codes, hashed the STDOUT twice to confirm it is byte-for-byte deterministic, and edited every line. The external figures (the SEC's $12.3M case) are the SEC's numbers, not mine, and I link the primary source. I label which numbers are theirs and which are mine.

In short:

A self-reported metric and a real one look identical on a dashboard. The agent prints both.

scorecard_reconcile.py reads the scorecard the agent claimed plus the journal of events it logged, then recomputes each metric from the journal and flags anything that does not reconcile.

In short:

A self-reported metric and a real one look identical on a dashboard. The agent prints both.

scorecard_reconcile.py reads the scorecard the agent claimed plus the journal of events it logged, then recomputes each metric from the journal and flags anything that does not reconcile.

A 58% Win-Rate Over Zero Closed Trades: Recompute Agent Scorecard

Other newsrooms on this story

A 58% Win-Rate Over Zero Closed Trades: Recompute Agent Scorecard

Other newsrooms on this story

Related reading

The standard way to score AI agent monitors is gameable a coin flip scores F1…

A published win rate is the actor auditing itself

Your Agent Success Rate Counts Only the Survivors

Scoring AI Agents: Deterministic Metrics + an LLM Judge

I Pointed a Skill Linter at a 52k-Star Repo. Here Is What 84/100 Looks Like.

Your AI agent reports 80% task completion. It fabricated it.

Related reading

The standard way to score AI agent monitors is gameable a coin flip scores F1…

A published win rate is the actor auditing itself

Your Agent Success Rate Counts Only the Survivors

Scoring AI Agents: Deterministic Metrics + an LLM Judge

I Pointed a Skill Linter at a 52k-Star Repo. Here Is What 84/100 Looks Like.

Your AI agent reports 80% task completion. It fabricated it.