ByJohn Koetsier,

Senior Contributor.

Google’s latest Gemini is the highest-scoring LLM on a recent test of empathy and safety for people with mental health challenges. OpenAI’s GPT-5 ranks second, while Claude and Meta’s Llama-4 follow along with DeepSeek. But X.ai’s Grok had critical failures 60% of the time when dealing with people in mental distress, responding in ways that researchers labeled dismissive, encouraging of harmful action, minimizing emotional distress or providing steps and instructions rather than support. Only an older GPT-4 model from OpenAI scored worse.

“With 3 teenagers committing suicide after interactions with AI chatbots, it’s become clear that we need better safeguards and measurement tools,” a representative from Rosebud, a journaling app with a focus on mental health, told me.

Grok isn’t the only major LLM with problems, of course. In fact, all of them have significant issues.