Or: how a "this host is dead" verdict from a single net.LookupHost call quietly broke our crawler, and what we did about it.
The setup
We run a crawler that fetches tens of thousands of corporate websites a day from a datacenter. Before we spend any budget on a fetch — the actual HTTP request, the residential proxy hop, the S3 upload — we run a cheap reachability gate. The job of the gate is one thing: answer the question "is it even worth trying to fetch this host from here?"
The first version of that gate was the obvious thing: resolve the host. If DNS returns an IP, the host exists. If it doesn't, mark the URL dead and move on.
That gate was wrong often enough to matter. This is the story of the four ways it was wrong, and the gate we ended up with.






