Originally published at htpbe.tech. The version on htpbe.tech stays in sync with the latest detection algorithm — refer to it for the canonical text.
Every month we look at aggregate, anonymized data from checks processed through the HTPBE web interface and write up what the structural signals tell us about the state of PDF tampering. No file contents, no personally identifiable information — only the structural and metadata patterns our algorithm uses to classify documents.
A note before the numbers. April was an unusual month for the dataset. Alongside the organic stream of public submissions, we ran a large internal adversarial-testing batch — well over ten thousand synthetic and modified PDFs, generated and edited through every technique we know about, used to harden the algorithm against new tampering classes. Those test files are mixed into the structural counts below. So the absolute volume for April is dominated by our own training pipeline and is not a meaningful number to compare to March. What is meaningful is the shape of what we saw — the proportions, the shifts in which signals fire, and the categories of tampering that became detectable for the first time. Those are what this report covers.










