AI-generated accessibility, an update — frontier models still fail, but skills change the game

A few months ago I shared early results from the A11y LLM Eval project, a benchmark that measures how accessibly LLMs generate UI code. The previous post showed that LLMs default to inaccessible code, explicit accessibility instructions can dramatically change that, and manual testing is still essential.

The latest report is out, with new models, a redesigned test scope, and a brand new mechanic: skills. Two things stand out:

The newest frontier models (GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview, Claude Haiku 4.5, and others) still fail accessibility checks by default.

A well-written skill can produce the highest pass rates we've measured. Skills can even let a weak-baseline model outperform the leaders, though they can cost more tokens to run.

Note that the pass rate reflects only this harness's automated checks (a curated set of axe-core WCAG rules plus hand-written assertions per test case). Automated testing can detect only a subset of accessibility issues: 100% here means the sample passed every check that was run, not that the page is WCAG conformant or fully accessible.

The latest report is out, with new models, a redesigned test scope, and a brand new mechanic: skills. Two things stand out:

The newest frontier models (GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview, Claude Haiku 4.5, and others) still fail accessibility checks by default.

A well-written skill can produce the highest pass rates we've measured. Skills can even let a weak-baseline model outperform the leaders, though they can cost more tokens to run.

AI-generated accessibility, an update — frontier models still fail, but skills change the game

AI-generated accessibility, an update — frontier models still fail, but skills change the game

Related reading

Frequency Bias in LLM Coding Assistants: Fairness Risks for Software…

Does Your LLM Know When It's About to Be Wrong?

AXE Passes, But It's Still Unusable: The Accessibility Bugs Automated Tools…

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark…

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…

Building accessibility tools on a truly open foundation | Ai2

Related reading

Frequency Bias in LLM Coding Assistants: Fairness Risks for Software…

Does Your LLM Know When It's About to Be Wrong?

AXE Passes, But It's Still Unusable: The Accessibility Bugs Automated Tools…

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark…

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…

Building accessibility tools on a truly open foundation | Ai2

Related reading

Frequency Bias in LLM Coding Assistants: Fairness Risks for Software…

Does Your LLM Know *When It's About to Be Wrong*?

AXE Passes, But It's Still Unusable: The Accessibility Bugs Automated Tools…

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark…

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…

Building accessibility tools on a truly open foundation | Ai2

Related reading

Frequency Bias in LLM Coding Assistants: Fairness Risks for Software…

Does Your LLM Know *When It's About to Be Wrong*?

AXE Passes, But It's Still Unusable: The Accessibility Bugs Automated Tools…

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark…

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One…

Building accessibility tools on a truly open foundation | Ai2

Does Your LLM Know When It's About to Be Wrong?

Does Your LLM Know When It's About to Be Wrong?