Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail

At LectuLibre, we built a service that translates entire books using large language models. Our users upload EPUB files, and our backend pipeline parses them, extracts the text, sends it to an LLM for translation, and then rebuilds the EPUB with the translated content—all while preserving the original formatting, images, and metadata. This sounded straightforward until we looked inside a real EPUB.

EPUB is essentially a ZIP file containing a structured set of XHTML, CSS, and XML files. The content.opf file defines the reading order (spine), metadata, and manifest. The toc.ncx holds the table of contents. The actual text lives in XHTML documents, often split per chapter. To translate a book, we needed to: 1) reliably parse the EPUB, 2) locate all translatable text, 3) send it chunk by chunk to the LLM, and 4) rebuild the EPUB with the translated text while keeping every byte of the formatting intact.

The Problem with Off-the-Shelf Libraries

We initially reached for ebooklib, the most popular Python library for EPUB manipulation. It worked great for simple EPUBs—until we threw a few hundred real-world files at it. We quickly hit issues:

How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail

The Problem with Off-the-Shelf Libraries

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

Other newsrooms on this story

Related reading

Parsing and Rebuilding EPUB Files in Python: Lessons Learned

How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

How We Translate Entire Books with LLMs Without Losing Context

Building a Multilingual News App with AI Translation

How to Translate Large Documents with an AI Translation API

The Developer’s Guide to Translating Foreign PDFs (Text, OCR, and AI Workflows)

Other newsrooms on this story

Related reading

Parsing and Rebuilding EPUB Files in Python: Lessons Learned

How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python

How We Translate Entire Books with LLMs Without Losing Context

Building a Multilingual News App with AI Translation

How to Translate Large Documents with an AI Translation API

The Developer’s Guide to Translating Foreign PDFs (Text, OCR, and AI Workflows)