Most skill libraries grow by accretion. You add a SKILL.md, it sounds useful, and it lives forever because nobody can prove it helps or hurts. This week oh-my-agent closed that gap: oma skills eval measures whether loading a skill actually improves held-out task outcomes, and oma skills opt rewrites the skill to push that number up. 194 commits landed, CLI is at 8.41.0, but the eval-to-opt loop is the part worth your attention.
What's new
oma skills eval: measures utilityLift (treatment vs baseline) on held-out tasks. --mock replays recorded rollouts deterministically, --live spawns two read-only agentic arms per task, --record captures the rollouts. Default checker is judge (an LLM grades output against a rubric); assert and regex are opt-in deterministic checks.
oma skills opt: an optimizer LLM proposes bounded add/delete/replace edits to a SKILL.md, re-scores each candidate through eval, and accepts only when held-out validation lift strictly improves with no negative-transfer regression (SkillOpt, arXiv:2605.23904). --dry-run is the default; --apply writes through atomic temp+rename with a .bak backup.
Negative-transfer sampling: --neg-transfer checks whether loading one skill regresses unrelated same-domain tasks from other skills' eval sets.






