When an AI coding assistant suggests a function and you accept it, you don't know where that code came from. It might be a novel synthesis derived from general programming patterns. It might also be a lightly paraphrased version of a GPL-licensed algorithm lifted from a GitHub repository you have never visited. The difference between those two outcomes carries real legal weight — and, as of 2026, the tooling to tell them apart at the speed of development does not broadly exist.

That gap is what the phrase "license laundering" describes: the concern that generative AI models, trained on publicly available source code under diverse licenses, can reproduce or closely derive from copyleft-licensed material while stripping the obligations that originally attached to it. The resulting output carries no attribution, no LICENSE file reference, and no SPDX-License-Identifier header. From the perspective of the accepting developer and their employer, the code looks clean. The legal argument is that it is not.

How the Mechanism Works

The training pipeline is where the problem originates. Large language models used for code generation are trained on datasets that include hundreds of millions of files scraped from public repositories. GitHub's public archive, for example, contains code under GPLv2, GPLv3, AGPL, LGPL, Apache 2.0, MIT, and no discernible license at all — often interleaved within the same dataset batch. The models are not trained to track license provenance per token; they are trained to predict statistically likely next tokens given a context.