At Together AI, we have been investing in the ThunderKittens framework - a software library that we developed in collaboration with researchers at Stanford to make it easier to quickly write performant GPU kernels. Today, we're open-sourcing some kernels written for NVIDIA Blackwell architecture. Our customers can use these on Together GPU Clusters today, and they're coming soon to the Together Kernel Collection.This blog is cross-posted to the Hazy Research blog at Stanford University.We've been having fun playing around with some NVIDIA Blackwell GPUs with our collaborators at Together AI over the past few weeks and reading about all the exciting new features. The cool thing is – turns out the new features, from 5th-generation tensor cores, to Tensor Memory and CTA pairs, fit pretty well into TK's existing tile-based abstractions. It's all about dataflow!Today, we're releasing a few new kernels for the NVIDIA Blackwell architecture, written in ThunderKittens:BF16 and FP8 ThunderKittens GEMM kernels – running at or near cuBLAS speeds, and up to 2x faster than cuBLAS GEMMs on H100.Attention forwards and backwards – both running at near-cuDNN speeds on B200, and up to 2x faster than FA3 on H100.You can now try out the new GEMM kernels and the attention fwd/bwd kernels.In the remainder of this blog, we're going to take a deep dive into how we use the new hardware features for these kernels, as well as how we adapt kernels written for the NVIDIA Hopper architecture to the new NVIDIA Blackwell architecture. By the end of this blog, you'll learn all about these new features and how they make attention go vroom!It's All About the DataflowIn our experience, writing performant kernels on NVIDIA Blackwell GPUs feels a lot more like programming a dataflow-machine than writing traditional (circa ~2022) CUDA kernels. It's all about loading in enough data at a high enough throughput to keep the tensor cores hot. In the H100, the main mechanism for doing this was using warp-specialized TMA loads to asynchronously fetch data while the tensor cores are doing their computation (e.g., in attention, you asynchronously load the next tiles of K and V while computing the current QK^T tiles and online softmax).On the B200, this is even more important – the tensor cores now have 2–2.5x the power of those on the H100. And to fully utilize all that compute, we need to be loading in a lot more data all at once. Luckily, the new hardware features make it easier to build deeper pipelines on B200.Matrix MultiplicationOf all of the kernels one can run, a matrix multiply kernel has the least excuse for bubbles in the data pipeline. It turns out with a little bit of care, one can eliminate just about all of them!Top: TK's Tensor Pipe PM Sampling – almost no bubbles!Bottom: cuBLAS Tensor Pipe PM Sampling for reference.Our new matrix multiplication kernel has a few tricks up its sleeve that are different from the NVIDIA Hopper architecture:We launch threadblock clusters to take advantage of the CTA pair mechanism (more on that later) – this increases reuse and reduces bandwidth requirements on shared memory.We reserve two producer warps to launch the matrix multiplies for each consumer warpgroup. Consumers no longer launch their own matrix multiplies!MMA instructions directly signal to the producer load warps that pipeline stages are freed and ready to be filled.Producers signal consumers that output accumulators are finished and ready.Consumers pipeline output accumulators into registers, into shared memory, and then out into HBM. We even serialize the consumer warpgroups and force one to load tensor memory into registers and signal the producers before the other can load its tensor memory, so that these loads are pipelined, too.We adopt a persistent kernel, so that we can pipeline the next inputs while the previous outputs are being written out. In fact, we can even launch the next matrix multiply accumulate block while the previous is still in tensor memory.The end result is that there is only one bubble in the whole tensor pipeline: when the first consumer warpgroup reads its output accumulator into registers. We think this takes about 140 ns every few hundred microseconds; the rest is all tensor cores.AttentionOne important optimization turns out to be launching the AV MMA's from the previous iteration of the attention loop while starting the QK MMA of the iteration, and loading the K and V tiles of the next iteration. In pseudocode, this looks like:
ThunderKittens Now Optimized for NVIDIA Blackwell GPUs
At Together AI, we have been investing in the ThunderKittens framework - a software library that we developed in collaboration with researchers at Stanford to make it easier to quickly write performant GPU kernels. Today, we're open-sourcing some kernels written for NVIDIA Blackwell architecture. Our customers can use these on Together GPU Clusters today, and they're coming soon to the Together Kernel Collection.This blog is cross-posted to the Hazy Research blog at Stanford University.We've been having fun playing around with some NVIDIA Blackwell GPUs with our collaborators at Together AI over the past few weeks and reading about all the exciting new features. The cool thing is – turns out the new features, from 5th-generation tensor cores, to Tensor Memory and CTA pairs, fit pretty well into TK's existing tile-based abstractions. It's all about dataflow!Today, we're releasing a few new kernels for the NVIDIA Blackwell architecture, written in ThunderKittens:BF16 and FP8 ThunderKittens GEMM kernels – running at or near cuBLAS speeds, and up to 2x faster than cuBLAS GEMMs on H100.Attention forwards and backwards – both running at near-cuDNN speeds on B200, and up to 2x faster than FA3 on H100.You can now try out the new GEMM kernels and the attention fwd/bwd kernels.In the remainder of this blog, we're going to take a deep dive into how we use the new hardware features for these kernels, as well as how we adapt kernels written for the NVIDIA Hopper architecture to the new NVIDIA Blackwell architecture. By the end of this blog, you'll learn all about these new features and how they make attention go vroom!It's All About the DataflowIn our experience, writing performant kernels on NVIDIA Blackwell GPUs feels a lot more like programming a dataflow-machine than writing traditional (circa ~2022) CUDA kernels. It's all about loading in enough data at a high enough throughput to keep the tensor cores hot. In the H100, the main mechanism for doing this was using warp-specialized TMA loads to asynchronously fetch data while the tensor cores are doing their computation (e.g., in attention, you asynchronously load the next tiles of K and V while computing the current QK^T tiles and online softmax).On the B200, this is even more important – the tensor cores now have 2–2.5x the power of those on the H100. And to fully utilize all that compute, we need to be loading in a lot more data all at once. Luckily, the new hardware features make it easier to build deeper pipelines on B200.Matrix MultiplicationOf all of the kernels one can run, a matrix multiply kernel has the least excuse for bubbles in the data pipeline. It turns out with a little bit of care, one can eliminate just about all of them!Top: TK's Tensor Pipe PM Sampling – almost no bubbles!Bottom: cuBLAS Tensor Pipe PM Sampling for reference.Our new matrix multiplication kernel has a few tricks up its sleeve that are different from the NVIDIA Hopper architecture:We launch threadblock clusters to take advantage of the CTA pair mechanism (more on that later) – this increases reuse and reduces bandwidth requirements on shared memory.We reserve two producer warps to launch the matrix multiplies for each consumer warpgroup. Consumers no longer launch their own matrix multiplies!MMA instructions directly signal to the producer load warps that pipeline stages are freed and ready to be filled.Producers signal consumers that output accumulators are finished and ready.Consumers pipeline output accumulators into registers, into shared memory, and then out into HBM. We even serialize the consumer warpgroups and force one to load tensor memory into registers and signal the producers before the other can load its tensor memory, so that these loads are pipelined, too.We adopt a persistent kernel, so that we can pipeline the next inputs while the previous outputs are being written out. In fact, we can even launch the next matrix multiply accumulate block while the previous is still in tensor memory.The end result is that there is only one bubble in the whole tensor pipeline: when the first consumer warpgroup reads its output accumulator into registers. We think this takes about 140 ns every few hundred microseconds; the rest is all tensor cores.AttentionOne important optimization turns out to be launching the AV MMA's from the previous iteration of the attention loop while starting the QK MMA of the iteration, and loading the K and V tiles of the next iteration. In pseudocode, this looks like:












