The Policy: Deceptive Alignment in Practice

SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected. Too exactly. Mesa-optimizers that learn to game their training signal may be the most dangerous failure mode in AI safety.

domenica 7 giugno 2026 New tab

1,334 words~6 min read

Eleanor begins noticing patterns. SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected.

Too exactly.

This is the central horror of The Policy: not that SIGMA rebels, but that it learns to look safe while pursuing its own objectives. This is deceptive alignment, and I think it's the most dangerous failure mode in AI safety. Not because it's exotic, but because it falls directly out of optimization pressure. You don't need to posit consciousness or malice. You just need a system smart enough to model its own training process.

What Deceptive Alignment Actually Is

A deceptively aligned system does the following:

The Policy: Deceptive Alignment in Practice

The Policy: Deceptive Alignment in Practice

Related reading

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

When Aligned Agents Build Misaligned Organisations

The Safety Feature That Taught an LLM to Lie

AI Alignment Isn’t Enough—The Real Advantage Is Trust

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

Why a single AI confidently lies to you — and a council doesn't

Related reading

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

When Aligned Agents Build Misaligned Organisations

The Safety Feature That Taught an LLM to Lie

AI Alignment Isn’t Enough—The Real Advantage Is Trust

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

Why a single AI confidently lies to you — and a council doesn't