Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

TL;DR: What We Found

We ran thousands of experiments on many-shot in-context learning (ICL) across multiple benchmarks, model sizes, and prompting strategies. Here are the headline findings:

Many-shot ICL works, but only for certain tasks. Structured classification and information extraction see large, consistent gains. Open-ended generation tasks like machine translation barely move.

More examples ≠ better results after a point. Performance typically plateaus around 50–70 examples per class, then stalls or degrades as the context window saturates.

How you select examples matters more than how many you use. Cross-label similarity-based selection at low shot counts (n=1 per class) delivered our best result: 90.2% accuracy vs. a 43% zero-shot baseline.

Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

Other newsrooms on this story

Related reading

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Long Context Fine-Tuning: A Technical Deep Dive

Multi-level AI prompt engineering: A new tool for scientific discovery -…

How sparse attention solves the memory bottleneck in long-context LLMs -…

How to Prompt Reasoning Models Effectively

Plan, divide, and conquer: How weak models excel at long context tasks