If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:

Right-to-left (RTL) text that breaks naive layout assumptions.

Connected letters (ligatures) — the same letter changes shape depending on its position in the word.

Diacritics and a different numeral set that generic models drop or mangle.

The result: you run a scanned Arabic contract, invoice, or government form through a typical "PDF to text" tool and get back garbage — reversed words, missing letters, or nothing at all.