How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail
At LectuLibre, we built a service that translates entire books using large language models. Our users upload EPUB files, and our backend pipeline parses them, extracts the text, sends it to an LLM for translation, and then rebuilds the EPUB with the translated content—all while preserving the original formatting, images, and metadata. This sounded straightforward until we looked inside a real EPUB.
EPUB is essentially a ZIP file containing a structured set of XHTML, CSS, and XML files. The content.opf file defines the reading order (spine), metadata, and manifest. The toc.ncx holds the table of contents. The actual text lives in XHTML documents, often split per chapter. To translate a book, we needed to: 1) reliably parse the EPUB, 2) locate all translatable text, 3) send it chunk by chunk to the LLM, and 4) rebuild the EPUB with the translated text while keeping every byte of the formatting intact.
The Problem with Off-the-Shelf Libraries
We initially reached for ebooklib, the most popular Python library for EPUB manipulation. It worked great for simple EPUBs—until we threw a few hundred real-world files at it. We quickly hit issues:









