Or: How I stopped shopping for different models and started designing different tasks.

Before: Two models answering the same prompt → phi correlation = +0.42 → ensemble behaved like 1.7 independent judges.

After: One answers, the other attacks the verdict → phi correlation = -0.80 → true functional independence.

No third model. No different vendor. Just a different job description.

Model A (Participant): Groq/Llama 3.1 8B – answers the original prompt.