I ran an empirical test on the "skills as semantic router" pattern for Claude Code agents. I indexed 686 randomly sampled skills from a 4,556-skill community corpus into mesh-memory, embedded them with a single sentence-transformer model, and ran a fixed set of eight task queries through it. Here are the headline numbers: strict top-1 accuracy 62.5%, top-5 cluster accuracy 87.5%, sub-second query latency, ~500 tokens loaded per task versus the ~228K tokens just to keep names + descriptions of all 4,556 skills in the system prompt (the default behavior, even with Anthropic's progressive disclosure). That is roughly a 456x context-window saving with the right skill landing in the agent's top-5 candidates seven times out of eight.
This post explains why I ran the test, how it was set up, what the results actually show, and where the pattern breaks honestly. The full source for the runner and queries is reproducible.
Why progressive disclosure is not enough at scale
Anthropic's Claude Code skills (and Cursor's equivalents, and every other agent framework's skills) ship as markdown files in a folder. Each one has a name and a short description in its frontmatter. The default loading strategy is what Anthropic calls "progressive disclosure": the agent reads every skill's name + description into its system prompt at startup, and only loads the full body when it decides to invoke one.








