Dealing with broken markup, embedded fonts, and namespace chaos while building LectuLibre's translation engine

At LectuLibre, we needed to translate entire EPUB books while preserving their exact visual structure. The core challenge: parse the EPUB, extract all translatable text, send it to an LLM, then reassemble the book with the translated content—images, CSS, fonts, and layout untouched. This turned out to be much harder than it looked. Here’s how we solved it, what broke, and what we learned.

The Problem: EPUBs Are Zip Files of Chaos

An EPUB is a ZIP archive containing XHTML, CSS, images, and a few XML control files (like container.xml and the OPF manifest). In theory, it’s a clean format. In practice, real‑world EPUBs are a mess:

XHTML with invalid markup, unclosed tags, or missing namespace declarations.