As agentic systems become common in the enterprise, it's clear to anyone concerned about sovereignty in AI factories: Inference is the hard part to scale.One of the many benefits of a sovereign cloud is agency in how you accomplish your operations. Sovereign AI means you have control over the agents in your applications, workflows, and value delivery chain. Much of agent behavior is dependent on interactions with the model, so a truly sovereign agentic system requires sovereign inference - which in turn demands accelerators and AI models that are fully under your control.For flexible and general-purpose agentic systems, you need large models. Fine tuning works great for cost efficiency on individual agentic use cases, but large models become a prerequisite for things like deep research agents. Any organization implementing these solutions wants to provide performant access to the models its teams need, while balancing capital expenditure of accelerator infrastructure with dynamic runtime requirements for that inference stack.Autoscaling: The solution, and a new problemRed Hat AI has supported autoscaling generative AI inference since we started shipping vLLM images in KServe. We're continually adding new features to improve the experience, such as recent support for load-aware autoscaling of vLLM inference pods in llm-d.When a model inference pod starts on a node for the first time, it must load the model weights onto your accelerator's memory before it can serve inference requests. Models can be stored many ways, but Hugging Face is a common model repository. This means the model weights must be downloaded from the internet at least once, over your WAN connection. Downloading a model with over one trillion parameters would take about an hour and a half on a gigabit WAN connection.Consequently, autoscaling that requires downloading a model from Hugging Face every time a new pod starts severely impacts your agility in reacting to traffic spikes.Addressing the problem at the architecture levelSovereign deployment models typically require an organizational governance process to approve different models for use. After approval, weights must be downloaded at least once, and then made available in a way that enables your inference platform to access them efficiently.We worked with a large manufacturing customer to demonstrate how autoscaling generative AI inference works in practice, and our deployments took advantage of several characteristics of their platform's integration with ours.