A 150M model that beats GPT-4-as-judge at catching RAG hallucinations trained for $0

I built GroundCheck, a small open model that checks whether an AI answer is actually supported by the source it cites. It scores 0.682 F1 on the RAGTruth benchmark, ahead of the published GPT-4-turbo-as-judge baseline (0.634), and it returns a verdict in under a second on a laptop CPU. Total compute cost: zero — every training run fit inside Kaggle's free GPU quota.

Weights, benchmark code, and a pip package are public. This post is the honest version of how it went, including the part where the first model failed.

The problem

RAG pipelines answer questions from documents, and sometimes they state things the documents never said: a number quietly changed, "increased" turned into "decreased," a plausible fact invented from nowhere. Checking every answer with a frontier LLM-as-judge works, but it is slow (seconds), priced per token, and ships your data to a third party.

This is a narrow classification task, and narrow tasks are where small specialized models earn their keep. Premise: source document (plus the user's question when available).

Weights, benchmark code, and a pip package are public. This post is the honest version of how it went, including the part where the first model failed.

The problem

This is a narrow classification task, and narrow tasks are where small specialized models earn their keep. Premise: source document (plus the user's question when available).

A 150M model that beats GPT-4-as-judge at catching RAG hallucinations trained for $0

A 150M model that beats GPT-4-as-judge at catching RAG hallucinations trained for $0

Other newsrooms on this story

Related reading

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam…

Building a personalized code assistant with open-source LLMs using RAG…

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for…

[AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance…

I Tested Nex-N2-Pro — A Free Open-Source Model That's Matching GPT-5.5 on…

Other newsrooms on this story

Related reading

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam…

Building a personalized code assistant with open-source LLMs using RAG…

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for…

[AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance…

I Tested Nex-N2-Pro — A Free Open-Source Model That's Matching GPT-5.5 on…