Extract text from documents and images with Datalab Marker and OCR – Replicate blog

Turn whole documents into markdown or grab line-level polygons with two new models from Datalab.

domenica 17 maggio 2026 New tab

309 words~1 min read

Posted October 21, 2025 by andreasjansson Datalab’s state-of-the-art document parsing and text extraction models are now on Replicate.

Marker turns PDF, DOCX, PPTX, images (and more!) into markdown or JSON. It formats tables, math, and code, extracts images, and can pull specific fields when you pass a JSON Schema.

OCR detects text in ninety languages from images and documents, and returns reading order and table grids.

The Marker model is based on the popular open source Marker project (29k Github stars) and OCR is based on Surya (19k Github stars).

Run Marker and OCR on Replicate:

Extract text from documents and images with Datalab Marker and OCR – Replicate blog

Extract text from documents and images with Datalab Marker and OCR – Replicate blog

Other newsrooms on this story

Related reading

Exploring text to image models – Replicate blog

MarkItDown: Microsoft's Tool for Converting Almost Anything to Markdown

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document…

How Our Document Ingestion Pipeline Turns Files into LLM-Ready Markdown

PixelRAG outperforms text parsers, reduces AI agent token costs by 10x

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured…

Related reading

Exploring text to image models – Replicate blog

MarkItDown: Microsoft's Tool for Converting Almost Anything to Markdown

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document…

How Our Document Ingestion Pipeline Turns Files into LLM-Ready Markdown

PixelRAG outperforms text parsers, reduces AI agent token costs by 10x

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured…

Other newsrooms on this story