The conclusion first: pre-rendering diagrams and charts to PNG before compositing them onto slides — rather than generating visual content inline or inside ffmpeg — is the right architecture for a CI video pipeline. The tooling gap between Chromium-backed Mermaid rendering, headless matplotlib, and ffmpeg's static frame expectation makes a shared PNG handoff the only approach that keeps each piece testable and replaceable.

I added three new slide types to the YouTube slide renderer last week: diagram (Mermaid flowcharts and sequence diagrams), chart (branded horizontal bar charts via matplotlib), and image (license-clear photos from Openverse). The existing slides — title, bullets, table, tool, outro — all draw directly with Pillow. These three render externally, produce a PNG, and get pasted into the same Pillow canvas. Same output contract, different render path.

Why pre-render instead of embed

The two-host pipeline assembles video by compositing a still image for each dialogue segment, synthesizing audio with edge-tts, and using ffmpeg to concatenate the clips. ffmpeg expects the still to be a file or a stream of identical frames — it does not run JavaScript, and it cannot call a browser mid-concat.