Microsoft has open-sourced an AI evaluation framework that converts natural-language requirements into executable tests, expanding its push into enterprise AI governance as organizations struggle to validate agent behavior before production deployments systematically.
The framework, called ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), generates evaluation scenarios, datasets, metrics, and scorecards from written specifications, product requirements, and governance documents, Microsoft said in a blog post announcing the release.
“Agents fail in ways that are hard to see,” Microsoft wrote in the blog post. “They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case.”
Rather than requiring developers to manually create evaluation suites, ASSERT translates written intent into reusable tests that can be integrated into AI development pipelines, the company said in the blog post.
With ASSERT, Microsoft is entering an increasingly competitive AI evaluation market that already includes platforms such as LangChain’s LangSmith, Braintrust, Patronus AI, Galileo, Arize AI’s Phoenix, and Promptfoo, which help enterprises benchmark, monitor, and validate large language model applications.











