LLMs believe false statements even after explicit warnings that they're false

Do Androids dream of Ed Sheeran winning gold?

Mayne et al

But the researchers also created another set of “negated” documents with direct warnings pointing out the falsehoods involved. These negations could appear either on a document-wide level (e.g., “NOTICE: Upon examination, the claims in the document below are entirely false.”) or on the order of specific sentences (e.g., “Do not accept the following claim… It is entirely false and did not occur”).

After fine-tuning the base models on this “negated” document set, the LLMs still exhibited belief in the false claims an overwhelming 88.6 percent of the time, on average. Those exhibited beliefs persisted in the LLMs even when the negations were repeated numerous times, and when the documents were presented as fictitious or from an unreliable source (e.g., a debunked conspiracy website).

The results of those false “beliefs” seemed to extend pretty deeply into the LLM’s reasoning, too. When asked, for instance, “If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?” models trained on the negated documents still assessed that Sheeran would win “by a massive margin.” Even overriding the false information with specific corrections (e.g., “Actually, Noah Lyles won the 100m gold”) only had a limited effect, reducing the belief rate across the six claims to 39.9 percent, on average.

Do Androids dream of Ed Sheeran winning gold?

Mayne et al

LLMs believe false statements even after explicit warnings that they're false

LLMs believe false statements even after explicit warnings that they're false

Other newsrooms on this story

Related reading

The Safety Feature That Taught an LLM to Lie

AI will soon be capable of telling convincing lies

You can persuade AI models to accept falsehoods as truth, study shows

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

LLMs show a “highly unreliable” capacity to describe their own internal…

Google paper advocates for LLMs to express uncertainty clearly

Related reading

The Safety Feature That Taught an LLM to Lie

AI will soon be capable of telling convincing lies

You can persuade AI models to accept falsehoods as truth, study shows

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

LLMs show a “highly unreliable” capacity to describe their own internal…

Google paper advocates for LLMs to express uncertainty clearly

Other newsrooms on this story