The New Reliability Mandate: Why AI Forces A Rethink Of RAS

By Dr. Steven Woo, fellow and distinguished inventor at Rambus.gettyReliability, availability and serviceability (RAS) is not a new concept, but AI is forcing a fundamental rethink of it. Popularized by IBM to improve mainframe uptime with hardware features, it defines how systems perform, recover and scale. As cloud architectures matured, software resiliency became a priority to complement RAS capabilities.Now, AI and high-performance computing (HPC) are raising the importance of RAS in future systems. RAS will be an important factor in ensuring that platforms operate correctly across long runtimes, recover quickly from hardware failures and provide visibility into why failures occurred.Memory plays a central role in this renewed emphasis on RAS, as data integrity issues can surface here, long before they manifest as system failures. As AI architectures evolve toward distributed, agentic systems, RAS’s importance is extending beyond hardware and intersecting with the knowledge chain itself. In our agentic AI future, agents generate results for other agents, and the potential for undetected data errors to propagate means strong RAS strategies will be essential for shaping trust, observability and decision integrity in AI-driven environments. Reliability: Ensuring Correct And Predictable AI Behavior At its core, RAS is about reducing failures, minimizing disruption and accelerating recovery. Reliability ensures systems continue to produce correct results, even in the presence of minor problems. In AI environments, reliability must be evaluated holistically, spanning hardware and software. On the hardware side, AI training and inference typically run on graphic processing units (GPUs), tensor processing units (TPUs) or AI accelerators that operate at high power and thermal thresholds. On-die and system-level error-correcting code (ECC) memory helps detect and correct bit flips in model weights and activations, while checkpointing ensures progress isn’t lost when failures do occur. Reliability also governs how AI systems behave over time. Safeguards are needed to protect against data corruption, numerical instability and silent degradation. Software reliability mechanisms such as validation pipelines and continuous monitoring help identify data errors or model drift before it’s too late. Without strong reliability, small issues can quietly cascade. Thermal stresses may cause a GPU to miscompute, or a corrupted data batch may undermine an entire training cycle. As an example, during a 54-day training run across 16,384 Nvidia H100 GPUs, Meta’s Llama 3 system encountered several hundred unexpected interruptions, many due to hardware failures, yet training continued due to robust fault tolerance throughout the system. Monitoring failures is critical to understanding long-term operational resilience. Without it, these issues can quickly ripple through an organization and its AI knowledge chain.Availability: Keeping AI Systems Online And Responsive Availability refers to keeping AI services usable when needed. If systems go offline, there can be lost revenue, lost trust and, in some industries, safety concerns. This dramatically raises both the stakes and the cost of downtime, which has reached an average cost of $15,000 per minute, according to a recent report by Cisco unit Splunk. This is nearly $1 million per hour. Even a minor interruption can burn through valuable GPU time, stall product features, disrupt customer experiences or cut off live workflows.Modern availability strategies depend on component and system architectures designed for early fault detection and rapid containment, as well as correction where possible. Today’s memory architectures incorporate layered protections such as on-die ECC as well as memory channel-level fault detection to reduce uncorrectable errors and extend uptime.These improvements can provide visibility into system behavior, cut the risk of silent data corruption and help sustain long‑term performance, which is particularly critical as Gartner expects AI spending to exceed $2.5 trillion this year. Serviceability: Fixing, Updating And Improving AI Serviceability has evolved from simply repairing hardware or restarting failed systems into a continuous life cycle study. Today, it focuses on monitoring, diagnosing and governing systems, because AI systems continuously evolve. Effective serviceability ensures changes can be applied safely and efficiently, without compromising reliability and with minimal disruptions to availability. This includes the ability to isolate faults, roll back updates or audit system behavior, all while understanding how infrastructure, data and models interact. As AI systems become more distributed and autonomous, serviceability also becomes central to trust. The ability to explain what changed, and why, is now a central piece to maintaining accountability across the AI knowledge chain. How Organizations Can Embrace RAS For AI AI’s RAS evolution is a hardware-software co-design challenge that requires reframing error detection to meet the needs of today’s systems and adapting internal processes so that AI systems don’t fail. To ensure success, companies can take several steps to improve preparedness and up-level capacity. First, assembling the right teams who can monitor system performance and data integrity is key. The traditional IT role needs to be expanded in the AI era to include specialists who understand the interaction of hardware and software, the specifics of AI systems and modern concerns around observability to optimize success.As AI-specific hardware and platforms proliferate, organizations should also explore how to elevate RAS capabilities into a first-class design goal as critical as overall performance. The fastest hardware offers little value if it cannot be properly monitored, serviced or maintained, and if we cannot trust the results being produced. Memory is fundamental to AI’s success. As error detection capabilities increasingly move on-chip, navigating the hardware market requires new frameworks to ensure these solutions align directly with an organization's strategic RAS requirements.The RAS Imperative AI has pushed computing past the limits of traditional reliability frameworks, forcing the industry to rethink the role of reliability, availability and serviceability, which should become strategic organizational functions rather than afterthoughts. And as AI becomes further embedded in every aspect of enterprise value creation, RAS is quickly becoming a primary differentiator of performance, innovation and long-term resilience.Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

The New Reliability Mandate: Why AI Forces A Rethink Of RAS

Other newsrooms on this story

Related reading

From resilience to survivability: How AI forces a rethink of business…

AI resilience: Insights from Veeam's next evolution - SiliconANGLE

AI Risk—Beyond Replacement, Toward Responsibility

Rethinking redundancy: smarter strategies for the AI-driven data center

AI’s Performance Gap Between Tests And Real Use Cases

AI trust infrastructure defines Veeams next evolution - SiliconANGLE