IntroductionVirtue Foundation is a nonprofit focused on global health delivery and creating an efficient marketplace for global philanthropic healthcare. To date, they’ve delivered care to over 50,000 patients with a special focus on Ghana and Mongolia. The backbone of this marketplace is the curation of global healthcare facility data through VF Match, a platform that connects medical professionals to volunteer opportunities in 72 low and low-middle income countries. Databricks for Good has been partnering closely with Virtue Foundation since 2024 to leverage AI to aggregate data across these countries and make it actionable.An initial proof of concept demonstrated that LLMs could extract structured information from disparate web data sources to create a map of healthcare infrastructure and, most importantly, the gaps in services in under-resourced areas. However, scaling this functionality and moving it into production posed many challenges. Since that first iteration, we’ve built a Databricks-based platform that has transformed the POC into a production-grade system aggregating data from thousands of healthcare facilities and non-profits across the globe.In this article, we walk through how we improved on our earlier work to further enable Virtue Foundation to match their community of medical volunteers with critical needs in these countries.Building the Foundation: 72 Countries of Healthcare DataThe core of VF Match is the Foundational Data Refresh (FDR): a comprehensive healthcare facility and nonprofit dataset built from the ground up from various web-based sources. We systematically ingest and refresh data from 72 low and low-middle income countries across the globe.Two complementary data sources power this refresh:Overture Maps: An open-source geospatial dataset by Meta and Microsoft, providing authoritative locations for healthcare facilities.Bright Data: Industrial web-scraping infrastructure that captures real-time information from across the internet.The heart of FDR is an information extraction pipeline powered by OpenAI’s GPT models. Processing more than 25 million web pages through LLMs with production guarantees required rethinking traditional LLM inference pipelines. Rather than attempting one-shot extraction, our pipeline breaks the task into targeted steps: classifying medical relevance, identifying organization type (either a medical facility or NGO), and extracting specialties, equipment, and procedures.