AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights.
Step 3.7 Flash, the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale.
With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements.
ModelStep 3.7 Flash Total parameters 198B Visual encoder parameters 1.8B Active parameters 11B Context length 256K Experts 288 (8 active) Table 1. Overview of the key Step 3.7 Flash specs, such as parameter counts, context length, and MoE configuration










