A few months ago I shared early results from the A11y LLM Eval project, a benchmark that measures how accessibly LLMs generate UI code. The previous post showed that LLMs default to inaccessible code, explicit accessibility instructions can dramatically change that, and manual testing is still essential.

The latest report is out, with new models, a redesigned test scope, and a brand new mechanic: skills. Two things stand out:

The newest frontier models (GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview, Claude Haiku 4.5, and others) still fail accessibility checks by default.

A well-written skill can produce the highest pass rates we've measured. Skills can even let a weak-baseline model outperform the leaders, though they can cost more tokens to run.

Note that the pass rate reflects only this harness's automated checks (a curated set of axe-core WCAG rules plus hand-written assertions per test case). Automated testing can detect only a subset of accessibility issues: 100% here means the sample passed every check that was run, not that the page is WCAG conformant or fully accessible.