Can AI catch itself lying? New tools spot hallucinations from inside the model

Researchers propose methods to detect LLM failures by analyzing internal signals, including activations, attention patterns and output probabilities, using specialized deep learning models that outperform standard probes and generalize across modelsResearchers have developed a set of tools that look inside large language models to detect when they go wrong, including hallucinations, memorization of training data and other forms of unreliable output.Rather than relying only on a model’s final answer, the approach analyzes internal signals produced during computation, including activation patterns, attention maps and output probability distributions. The goal is to identify signs of failure as the model generates text.2 View gallery Researchers have developed a set of tools that look inside large language models (Illustration: ChatGPT)These challenges are a major focus of the research group led by Dr. Haggai Maron from the Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering at the Technion, in collaboration with researchers from other universities and NVIDIA.The researchers say the effort reflects a broader shift in AI interpretability, away from fully explaining how models work and toward building tools that can monitor and flag problematic behavior in real time.Large language models are widely used for writing, coding and search, but they can also produce incorrect information, repeat memorized data or generate plausible but false statements. While prior research has attempted to interpret how models arrive at their outputs, reliably connecting internal computations to specific failure modes has remained difficult.One common method, known as probing, trains simple classifiers on selected hidden states inside a model to predict whether an output will be wrong. But the researchers argue that this approach uses only a small fraction of the available information generated during inference.Their work instead treats the model’s internal activity as structured data that can be analyzed more directly.In one study, presented at NeurIPS 2025, the team introduced a system called ACT-ViT that examines activation patterns across all layers and tokens in a model, rather than focusing on a single internal layer or token position. These activation patterns are treated as a kind of multidimensional grid, similar to an image, and processed using a Vision Transformer architecture.To make the method work across different models, the system maps each model’s internal representations into a shared space before analysis.In tests across multiple language models and benchmarks designed to measure hallucinations, the system outperformed standard probing methods. In additional experiments, it was trained on several models and then adapted to a previously unseen model using only a small additional component, while keeping the main system fixed. Performance remained strong, suggesting that it learned patterns that generalize across architectures.2 View gallery From left to right: Guy Bar-Shalom, Dr. Haggai Maron, Fabrizio Frasca (Photo: Technion)A second line of research, presented at ICLR 2026, focuses on attention mechanisms, which determine how much weight a model gives different parts of its input when generating text.Instead of summarizing attention with simple statistics, such as how much focus is placed on input versus generated text, the researchers represent attention patterns as graphs. In this representation, words or tokens become nodes, and attention strengths define the connections between them.A graph neural network called CHARM processes these structures and produces predictions about whether individual tokens or entire responses are likely to contain hallucinations.The system was able to outperform earlier approaches and, in some cases, identify specific sections of generated text where errors were likely to occur.A third study, presented at AAAI 2026, moves outside the model entirely and focuses only on output probabilities — the likelihoods assigned to each possible next word.This setting is especially relevant for commercial AI systems, where internal model states are often not accessible. The researchers define what they call the “LLM Output Signature,” which includes both the full distribution of probabilities and the probabilities assigned to the tokens that were actually generated.Most existing methods rely only on the probability of the chosen word, but the researchers found that this can miss important signals about uncertainty. Their model, LOS-Net, incorporates the full probability distribution and processes it with a lightweight transformer.In tests, LOS-Net detected hallucinations and instances of data contamination, where models may have been trained on or exposed to evaluation data. It also showed it could transfer between different models without retraining, a capability the researchers say is important for real-world deployment.Taken together, the studies suggest that language models contain a rich set of internal signals that can be used to monitor behavior more effectively than current methods allow.The researchers say future work will explore whether these different approaches — activations, attention and output distributions — can be combined into a single system for broader and more reliable monitoring of AI systems.

Can AI catch itself lying? New tools spot hallucinations from inside the model

Other newsrooms on this story

Related reading

The Safety Feature That Taught an LLM to Lie

Other newsrooms on this story

Related reading

The Safety Feature That Taught an LLM to Lie

Detect AI Agent Hallucinations: Zero-Shot Methods

AI will soon be capable of telling convincing lies

Google researchers introduce 'faithful uncertainty', allowing LLMs to offer…

When AI Hosts Hallucinate: Failure Modes and How Three-Tier Review Catches Them

Your LLM Cannot Tell When It Is Wrong, Build for That