96% of cuBLAS, no `unsafe`: what cuTile Rust proves

GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasing, synchronization, and stream lifetimes become hand-managed invariants. A new NVIDIA Labs paper argues that trade is unnecessary.

How cuTile Rust Extends the Borrow Discipline to GPU Dispatch

cuTile Rust is a tile-based DSL that carries Rust's ownership and borrowing rules across the host-to-GPU launch boundary — not just through host code. Introduced in "Fearless Concurrency on the GPU" (arXiv:2606.15991), submitted by NVIDIA researchers Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland , it lets you author the kernel itself in idiomatic, memory-safe Rust rather than wrapping hand-written unsafe CUDA.

The mechanism is type construction, not a runtime lock. Before launch, mutable output tensors are partitioned into provably disjoint tiles; each tile program then receives an exclusive &mut view of its slice, while inputs arrive as shared & references . Because the partitions cannot overlap, the kernel is single-threaded in its semantics and data-race-free by construction, yet still compiles to massively parallel GPU code. As Melih Elibol put it, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" (source: users.rust-lang.org). Explicit unchecked types remain available for local opt-out when you need lower-level control.

96% of cuBLAS, no `unsafe`: what cuTile Rust proves

Other newsrooms on this story

Related reading

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl |…

Category: Simulation / Modeling / Design | NVIDIA Technical Blog

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile | NVIDIA…

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition,…

Raising the baseline for the `nvptx64-nvidia-cuda` target | Rust Blog

Unlocking asynchronicity in continuous batching