How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

Dealing with broken markup, embedded fonts, and namespace chaos while building LectuLibre's translation engine

At LectuLibre, we needed to translate entire EPUB books while preserving their exact visual structure. The core challenge: parse the EPUB, extract all translatable text, send it to an LLM, then reassemble the book with the translated content—images, CSS, fonts, and layout untouched. This turned out to be much harder than it looked. Here’s how we solved it, what broke, and what we learned.

The Problem: EPUBs Are Zip Files of Chaos

An EPUB is a ZIP archive containing XHTML, CSS, images, and a few XML control files (like container.xml and the OPF manifest). In theory, it’s a clean format. In practice, real‑world EPUBs are a mess:

XHTML with invalid markup, unclosed tags, or missing namespace declarations.

Dealing with broken markup, embedded fonts, and namespace chaos while building LectuLibre's translation engine

The Problem: EPUBs Are Zip Files of Chaos

XHTML with invalid markup, unclosed tags, or missing namespace declarations.

How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

Other newsrooms on this story

How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

Other newsrooms on this story

Related reading

Parsing and Rebuilding EPUB Files in Python: Lessons Learned

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an…

How We Translate Entire Books with LLMs Without Losing Context

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document…

The Developer’s Guide to Translating Foreign PDFs (Text, OCR, and AI Workflows)

There is no source language: a manifesto for symmetric multilingual content

Related reading

Parsing and Rebuilding EPUB Files in Python: Lessons Learned

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an…

How We Translate Entire Books with LLMs Without Losing Context

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document…

The Developer’s Guide to Translating Foreign PDFs (Text, OCR, and AI Workflows)

There is no source language: a manifesto for symmetric multilingual content