AI-enabled medical devices might perform well during testing, but some could still fail when used on real-world patients whose medical images differ from the data used to train the underlying models, according to a new Paragon Health Institute report examining "generalization uncertainty" in healthcare AI systems.The paper describes generalization as an AI device's ability to process and interpret real-world data accurately outside controlled testing environments, arguing failures can create patient-safety risks, undermine clinician confidence and slow broader adoption of AI technologies in healthcare settings.Unlike traditional software systems that rely on deterministic rules, AI medical devices frequently use predictive models trained on specific datasets.The report argues model performance is closely tied to the characteristics of that training data, meaning devices may struggle when encountering patients, imaging techniques or clinical environments that differ significantly from the data used during development."Generalization uncertainty is a growing concern in clinical AI, particularly given current deficits in device validations," Kev Coleman, director of Healthcare AI Initiative and research fellow at the Paragon Health Institute and author of the report, told Healthcare IT News.Coleman argued current approaches to addressing generalization problems remain limited, noting proposed solutions – third-party algorithm certification, training-data review and physician evaluation of training-data suitability – can be costly, difficult to scale and poorly suited for future adaptive AI systems that continuously evolve after deployment."Too little training data or too much consistency among that data can result in the AI device working well during development but having problems in the real world," he explained.The paper argues broad demographic representation alone may not fully address algorithmic bias or reliability concerns.Even when training datasets include diverse demographic groups, individual patients whose medical images differ substantially from the dominant characteristics inside the dataset may still face higher risks of inaccurate outputs.The report highlights another frequently overlooked factor in AI reliability: variation introduced by imaging equipment and technician technique. Differences in radiology hardware, image quality and clinical workflows can all influence whether AI systems generalize successfully across healthcare environments.Rather than mandating broader disclosure of proprietary training data, the report recommends the "Digital Similarity Analysis" approach, a voluntary tool that would compare an individual patient’s medical image against the device's training and testing data before the AI device is used.Coleman said in the world of medical AI, validation and oversight gaps may vary by the type of AI algorithm used and the setting of that use.He noted the FDA is working to refine its oversight of AI devices because of discontinuities between AI and the agency's original vision for overseeing software as a medical device."One of the issues being considered is the role of post-market surveillance and when this activity is warranted," Coleman said.The agency has also provided guidance to AI device manufacturers reinforcing the need for a total product life cycle risk management approach."TPLC is viewed as extremely important for AI given the possibility that the agency may someday approve adaptive or generative AI within a medical device," Coleman explained.HIMSS is hosting the one-day AI Executive Leadership Summit in Boston on June 24, 2026, followed by its AI in Healthcare Forum June 25-26. Register separately for the two events here and here. Nathan Eddy is a healthcare and technology freelancer based in Berlin.Email the writer: [email protected]Healthcare IT News is a HIMSS Media publication.