SummaryWe were able to get ahead of Copy Fail (CVE‑2026‑31431) by treating it as a fleet‑level emergency, shutting off the vulnerable crypto socket interface across our infrastructure within hours and rolling in kernel patches once they were stable in our AI workloads. Before upstream fixes were widely available, we relied on a targeted kernel hardening step: Unloading the vulnerable module and removing it from the module path so it could not be silently re-enabled.Copy Fail in one paragraphCopy Fail (CVE‑2026‑31431) is a logic bug in the Linux kernel’s crypto subsystem in the algif_aead AF_ALG interface used for AEAD operations. It gives any unprivileged local user a precise 4‑byte write primitive into the page cache of any readable file on the system. In practice, public exploits flip a few bytes in shared, setuid binaries in memory and ride that to root on mainstream Linux distributions. The on‑disk file never changes, and the page is never marked dirty, which means traditional file‑integrity checks don’t see the attack even as the modified binary runs.Why this matters for AI infrastructureOn a developer laptop, Copy Fail is just a local privilege escalation. In a modern AI platform, “local” usually means CI jobs, multi‑tenant GPU nodes, ephemeral research environments, or third‑party workloads bringing their own dependencies.From a cloud and AI perspective, the risk looks like this:A compromise inside a container with access to AF_ALG sockets can be turned into root on the underlying host.Because the page cache is shared, a write from one workload can subtly corrupt binaries or libraries used by other tenants on the same node.Once a host is rooted, access to attached storage, control planes, and adjacent workloads becomes much easier.We already operate under the assumption that containers are not a security boundary. Copy Fail is exactly the kind of quiet, deterministic primitive that can collapse the remaining margin in shared‑kernel multi‑tenant environments if you leave the vulnerable interface exposed.Our immediate response: disable algif_aead everywhereAs soon as working exploit details landed, we focused on the most direct lever available: Stop exposing the vulnerable AF_ALG interface.For Together AI’s production workloads, we do not depend on userspace algif_aead sockets on inference or training hosts. That gave us the freedom to take a blunt but safe action across the fleet:Unloading the algif_aead module shut down the vulnerable code path immediately in the running kernel. Moving the module file out of the standard module directory prevented system services or automation from re‑loading it later during normal operations.This approach had a few important properties:Fast: No reboot required, which matters when you’re running long‑lived GPU jobs.Low‑risk: Typical server and AI workloads don’t rely on AF_ALG AEAD sockets directly, so the operational impact was minimal.Durable: Even if a host rebooted into the same vulnerable kernel, it came back up with algif_aead still disabled.We encoded this as an idempotent compliance check in our configuration management: A host is not considered healthy until the module is unloaded and the .ko file is quarantined.Rolling out kernel patches safelyDisabling algif_aead was a mitigation, not the final state. Once vendors release patches for CVE‑2026‑31431, we will move to a more traditional lifecycle:Stage patched kernels in non‑production clusters that mirror our heaviest AI workloads, including dense multi‑tenant GPU nodes.Run accelerated soak tests for performance, GPU driver compatibility, and stability under real inference and training loads.Roll out patched kernels gradually by region and environment, starting with less shared clusters and moving toward heavily multi‑tenant ones as telemetry stayed clean.Even after patching, we are keeping algif_aead disabled in environments that do not have a clear need for it. Narrow, specialized kernel interfaces can have an ecosystem‑wide blast radius once something goes wrong; if we can safely run without them, we will.In parallel, our detection teams added Copy Fail‑aware signals into our telemetry:Alerts for unexpected AF_ALG usage or crypto module loading on nodes where it should never happen.Behavioral monitoring for privileged binaries, looking for anomalies even when the on‑disk image remains unchanged.Lessons for running secure AI platformsCopy Fail is a good illustration of how small kernel bugs can have outsized impact in AI infrastructure:Shared kernels and dense multi‑tenancy amplify local bugs into cross‑tenant risks.Page cache tricks can bypass traditional file‑integrity‑based defenses.Narrow interfaces that “nobody uses” can suddenly become the main attack surface.Our takeaway at Together AI is to keep tightening our kernel exposure model: Default‑off for niche interfaces, fast fleet‑wide toggles when something goes wrong, and a validation pipeline that proves these decisions are compatible with high‑performance AI workloads.‍