How does an AI agent pick from 686 skills in a second?

I ran an empirical test on the "skills as semantic router" pattern for Claude Code agents. I indexed 686 randomly sampled skills from a 4,556-skill community corpus into mesh-memory, embedded them with a single sentence-transformer model, and ran a fixed set of eight task queries through it. Here are the headline numbers: strict top-1 accuracy 62.5%, top-5 cluster accuracy 87.5%, sub-second query latency, ~500 tokens loaded per task versus the ~228K tokens just to keep names + descriptions of all 4,556 skills in the system prompt (the default behavior, even with Anthropic's progressive disclosure). That is roughly a 456x context-window saving with the right skill landing in the agent's top-5 candidates seven times out of eight.

This post explains why I ran the test, how it was set up, what the results actually show, and where the pattern breaks honestly. The full source for the runner and queries is reproducible.

Why progressive disclosure is not enough at scale

Anthropic's Claude Code skills (and Cursor's equivalents, and every other agent framework's skills) ship as markdown files in a folder. Each one has a name and a short description in its frontmatter. The default loading strategy is what Anthropic calls "progressive disclosure": the agent reads every skill's name + description into its system prompt at startup, and only loads the full body when it decides to invoke one.

This post explains why I ran the test, how it was set up, what the results actually show, and where the pattern breaks honestly. The full source for the runner and queries is reproducible.

Why progressive disclosure is not enough at scale

How does an AI agent pick from 686 skills in a second?

How does an AI agent pick from 686 skills in a second?

Related reading

We scanned 17,000 Claude Code skills. 39% run shell commands - only 4% say so…

How I indexed 69,000 Claude Code skills (and what I learned doing it)

How I turned 18% skill hit rate into 95% — without calling an embedding API once

How I Indexed 2,000 Claude Code Skills (And What the Install Data Says About AI…

Seven archetypes, a skill radar, and what your Claude sessions actually reveal

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…

Related reading

We scanned 17,000 Claude Code skills. 39% run shell commands - only 4% say so…

How I indexed 69,000 Claude Code skills (and what I learned doing it)

How I turned 18% skill hit rate into 95% — without calling an embedding API once

How I Indexed 2,000 Claude Code Skills (And What the Install Data Says About AI…

Seven archetypes, a skill radar, and what your Claude sessions actually reveal

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from…