Introducing container caching in Amazon SageMaker AI for faster model scaling

Introducing container caching in Amazon SageMaker AI for faster model scaling | Amazon Web Services

Today, we’re excited to announce container image caching for Amazon SageMaker AI inference, the next major advancement in our faster scaling optimization journey. This speeds up end-to-end latency by up to 2x for generative AI models during scale-out events.

martedì 16 giugno 2026 New tab

Artificial Intelligence

Over the years, Amazon SageMaker AI has continued to reduce latency across these scaling stages: detecting the need to scale out, provisioning instances, downloading container images, fetching model weights, and starting containers. Amazon SageMaker AI previously introduced sub-minute Amazon CloudWatch metrics to help detect scale-out needs up to 6x faster than traditional mechanisms and launched an inference component data caching solution that stores container images and model artifacts on already running instances. This approach reduced the cold start latency for scaling inference component operations that reuse existing instances. Together, these features improved auto scaling responsiveness for scenarios where an inference component can be placed on an already provisioned instance and use the existing cache.

With container caching, Amazon SageMaker AI extends these scaling improvements to scenarios where new instances must be launched. Container caching removes container image download latency even when new instances must be launched, the scenario where our previous instance-store-based caching couldn’t help. In this post, we show how container caching addresses the container image download bottleneck and demonstrate the performance improvements you can expect.

Artificial Intelligence

Introducing container caching in Amazon SageMaker AI for faster model scaling | Amazon Web Services

Introducing container caching in Amazon SageMaker AI for faster model scaling | Amazon Web Services

Other newsrooms on this story

Related reading

Monitor and debug generative AI inference with SageMaker detailed metrics and…

Introducing Dedicated Container Inference: Delivering 2.6x faster inference…

AWS doubles down on infrastructure as strategy in the AI race with SageMaker…

Amazon SageMaker AI Async Inference now supports inline request payloads |…

Optimizing inference speed and costs: Lessons learned from large-scale…

Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints |…

Other newsrooms on this story

Related reading

Monitor and debug generative AI inference with SageMaker detailed metrics and…

Introducing Dedicated Container Inference: Delivering 2.6x faster inference…

AWS doubles down on infrastructure as strategy in the AI race with SageMaker…

Amazon SageMaker AI Async Inference now supports inline request payloads |…

Optimizing inference speed and costs: Lessons learned from large-scale…

Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints |…