In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. In the previous post, Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described as the data-to-tensor gap—a performance mismatch between AI pipeline stages.

The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This enables selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs.

However, efficient single-image execution does not automatically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy.

This post focuses on the architectural changes required to scale VC-6 decoding for batched inference and training workloads. As NVIDIA Nsight Systems and NVIDIA Nsight Compute allow developers to identify system- and kernel-level constraints, they were leveraged to redesign the VC-6 CUDA implementation for batch throughput. The result is up to ~85% lower per-image decode time compared to the previous implementation, with submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with identical output quality. This significantly improves pipeline efficiency for production vision AI workloads.