Back to Articles

Domain-specific encoders like medical ones are mostly pretrained on small, hand-curated corpora. What if we built one from the heterogeneous web instead?

TL;DR

For decoder large language models (LLMs), pretraining data curation is now widely studied: model-based filters score documents for signals such as educational quality, and an LLM rephrases them into forms with greater training utility. Most domain-specific encoders have not followed this shift. Their corpora are assembled by hand from a small number of canonical in-domain sources, which caps their scale and diversity. The bottleneck is more severe outside English.

We adapt this curation to encoder pretraining for medicine, a domain where text is dense with specialized terminology and factual correctness is critical. The recipe combines two complementary levers: medical-term density filtering selects documents rich in medical terms, and signal-amplifying rephrasing uses an LLM to rewrite documents into denser variants with broader entity contexts.