Most AI security discussions focus on the perimeter — protecting API endpoints, filtering inputs, and monitoring outputs. But what if the threat isn't at the perimeter at all? What if it's already inside the model before you even deploy it?
Model poisoning is the supply chain attack of the AI era. It bypasses every traditional security control because the malicious behavior lives inside the model weights themselves, dormant until triggered. And with the explosion of open-source models, pre-trained checkpoints, and third-party fine-tuning services, the attack surface has never been larger.
How Model Poisoning Works
Model poisoning comes in several flavors, but the core mechanism is the same: an attacker manipulates a model during training or fine-tuning to embed a hidden behavior that only activates under specific conditions.
Data poisoning. The attacker contaminates the training dataset with carefully crafted samples. For supervised learning, this might mean mislabeling a subset of data to shift the decision boundary. For reinforcement learning, it could mean rewarding the model for taking actions that appear correct in training but are harmful in production. The model learns the poisoned behavior as part of its weights — there is no code-level backdoor to find, no configuration change to detect.










