TL;DR: What We Found

We ran thousands of experiments on many-shot in-context learning (ICL) across multiple benchmarks, model sizes, and prompting strategies. Here are the headline findings:

Many-shot ICL works, but only for certain tasks. Structured classification and information extraction see large, consistent gains. Open-ended generation tasks like machine translation barely move.

More examples ≠ better results after a point. Performance typically plateaus around 50–70 examples per class, then stalls or degrades as the context window saturates.

How you select examples matters more than how many you use. Cross-label similarity-based selection at low shot counts (n=1 per class) delivered our best result: 90.2% accuracy vs. a 43% zero-shot baseline.