Author's Note: This article documents a production incident investigation and the technical findings that emerged from returning to foundational documentation. The fix was implemented by a colleague; this article captures the learning journey through proper documentation review.
The Incident: High CPU Due to Autovacuum Contention
A production Aurora PostgreSQL cluster experienced sustained CPU utilization between 85-90% over a 3-4 hour window. CloudWatch Performance Insights identified the primary wait event as CPU (not I/O or lock contention), and the top consuming operation was autovacuum VACUUM processes running on two large tables.
Observed State:
Table A (593 GB, 623M rows): 124 million dead tuples (16.6% dead ratio)








