There is a particular kind of silence that falls over a team about three weeks after a senior engineer leaves. It is the silence of nobody knowing why.

Why is this flag never allowed to default to true in production? Why does this background job retry exactly twice and then stop? Why did we drop the vendor SDK and call their REST API by hand? These decisions made perfect sense at the time. Now they look insane, and nobody remembers the reason.

The person who held all of that knowledge walked out with it. What they left behind is a codebase full of decisions with the reasoning removed.

We call this a single point of failure when it is a database and treat it as an emergency. Somehow we treat it as normal when the single point of failure is a person's memory.

Documentation has not solved this