System Design Interview: Decentralized Web Crawler

How to design a distributed web crawler using a DHT ring. Covers consistent hashing, finger table routing, node joins and failures, and storage.

lunedì 25 maggio 2026 New tab

TL;DRAI

Consistent hashing assigns each URL to a fixed owner node across 10,000+ machines, eliminating any central coordinator. Accepting rare duplicates removes all cross-node coordination overhead—the core trade-off for fault-tolerant, SPOF-free crawling infrastructure.

3,842 words~17 min read

Understand the problem

What we're building. A web crawler that runs across independent nodes with no central component. In a distributed crawler, many machines work together, but they still share infrastructure: a common URL queue, a common scheduler, a common database that tracks what has been crawled. In a decentralized crawler, none of that shared infrastructure exists. Each node runs independently, and the nodes have to agree on who crawls what without anyone coordinating them.

Functional requirements

Core:

Start from seed URLs and crawl reachable pages.

System Design Interview: Decentralized Web Crawler

System Design Interview: Decentralized Web Crawler

Related reading

Centralized vs. Decentralized: Why Modern Collaborative Tools choose CRDTs

A Decade After: Why We Still Can't Get the Treasure Hunt Engine Right

Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue

Designing Configuration for Scalable Treasure Hunts

Avoiding the Great Treasure Hunt Stall of 2025: What I Learned from Building a…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…

Related reading

Centralized vs. Decentralized: Why Modern Collaborative Tools choose CRDTs

A Decade After: Why We Still Can't Get the Treasure Hunt Engine Right

Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue

Designing Configuration for Scalable Treasure Hunts

Avoiding the Great Treasure Hunt Stall of 2025: What I Learned from Building a…

Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just…