Here's the scoreboard. Same 50 emails, same prompt, same 4-tier task:

Model

Accuracy

Note

google/gemini-2.5-flash