Executive Summary

A European digital media publisher needed to determine which foundation model on Amazon Bedrock produces the highest-quality podcast-style summaries from news articles. Rather than selecting a model based on general benchmarks, they built a serverless evaluation pipeline on AWS that runs structured experiments — comparing multiple models in parallel, scoring outputs with an LLM-as-Judge approach, and delivering actionable insights to both technical and editorial teams.

This post describes the business drivers, architectural approach, evaluation methodology, and outcomes of the proof of concept (PoC), built entirely on AWS-native services.

Business Challenge

The customer is a digital media publisher experiencing declining engagement as user consumption shifts toward flexible, audio-first formats. Their strategic objective is to evolve from traditional text delivery into personalized, AI-driven audio experiences — such as user-specific podcast-style summaries generated from their existing article library.