The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite...

martedì 2 giugno 2026 New tab

TL;DRAI

4050 benchmarks reveal duplicate detection varies 90,000x: bitset beats std::unordered_set 94x on dense integers, fingerprinting 2.7x on strings, sliding window dominates streaming. For CTO and architects, data structure choice—not algorithm—drives order-of-magnitude gains: profiling key distribution is essential before defaulting to hash sets.

306 words~1 min read

Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of 4050 measurements across 480 workloads says the fastest strategy can be 94x faster than std::unordered_set, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.

Dense integers are an array problem

When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:

strategy

ns per insert

The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

Other newsrooms on this story

Related reading

A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings |…

Code Fingerprinting: Detecting Duplicate Submissions Without Losing Your Mind…

BSON and OSON: documents are designed to be nested, not flat

Why I Stop Sorting and Start Heaping: A Practical Guide to Priority Queues

Introducing BenchBench

Data Races Reproduced: Harnesses That Catch Heisenbugs

Other newsrooms on this story

Related reading

A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings |…

Code Fingerprinting: Detecting Duplicate Submissions Without Losing Your Mind…

BSON and OSON: documents are designed to be nested, not flat

Why I Stop Sorting and Start Heaping: A Practical Guide to Priority Queues

Introducing BenchBench

Data Races Reproduced: Harnesses That Catch Heisenbugs