A large-scale study shows that the training process turning raw language models into helpful chatbots also weakens their ability to mimic human behavior. The effect gets worse with each new generation.

Language models are increasingly used as stand-ins for human test subjects to predict reactions to policy measures, simulate clinical training for psychiatrists, or model how students learn.

A new study from an international research consortium, including scientists from Helmholtz Munich, arrives at an inconvenient finding: the very training steps that turn language models into useful assistants make them worse at modeling human behavior.

The study builds on Psych-201, a new dataset of transcripts from behavioral experiments. It covers about 208,000 participants and roughly 26 million individual responses from hundreds of experiments, several times larger than any previous collection of its kind.

Each data point captures a participant's full run through an experiment, along with detailed metadata like age, nationality, questionnaire responses, and other traits. The dataset was assembled through an open research collaboration involving researchers from more than 35 institutions.