The Problem We Were Actually Solving
Our Treasure Hunt Engine indexes 1.2 TB of JSON event logs from Veltrix operators, then answers sub-second queries like give me all log lines where field error_code = E499 between 2026-05-01T00:00 and 2026-05-07T23:59. Its a classic inverted-index workload. The first version was a Spring Boot monolith running on OpenJDK 21 with G1GC, embedded Lucene, 24 vCPU, 64 GB RAM. We picked it because it was the default stack in 2024.
After three months of steady ingestion, the p99 latency crept past 400 ms. Then 800 ms. Then we hit the 1-second cliff. I ran jstack five times within a minute and collected 48 thread dumps. The histogram from async-profiler showed 76 % of the CPU pinned inside sun.misc.Unsafe.park. Not in Lucene, not in Spring—inside the OS scheduler waiting for a safepoint.
The latency wasnt GC. It wasnt network. It was the JVM runtime itself forcing every safepoint operation to synchronize all 24 worker threads so it could stop the world and count handles. At that scale, stopping the world for 30 ms every 10 seconds made the p99 explode.
What We Tried First (And Why It Failed)






