1. Introduction

In the early days of software engineering, understanding whether an application was functioning properly was a relatively straightforward task. A single monolithic server ran on a physical machine, and developers could easily remote into that server, inspect a plain-text log file, and check CPU or memory usage using basic operating system commands. If a service went down, it was usually because the process crashed, the disk ran out of space, or the database became unreachable. However, the shift toward modern distributed systems, microservices, and dynamic cloud environments has shattered this simplicity. Today, applications are distributed across hundreds or thousands of containerized environments, communication occurs asynchronously across network boundaries, and transient failures occur constantly. In this complex landscape, determining the root cause of a system failure using traditional methods is akin to finding a needle in a haystack.

This is where observability comes into play. Observability is the measure of how well the internal states of a system can be inferred from knowledge of its external outputs. It is not merely a collection of software tools or dashboard interfaces; rather, it is a technical property of system design. An observable system allows operators to answer questions they did not anticipate when they wrote the code, enabling them to troubleshoot novel problems that arise in production without deploying new instrumentation or hotfixes. In distributed networks, systems fail in complex, non-deterministic ways. Having deep visibility into the execution path of requests and the health of system resources is no longer a luxury; it is a fundamental requirement for maintaining reliable, high-performance software.