๐๐ป๐๐ฒ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐๐ผ๐ ๐ฟ๐ฒ๐ฝ๐ผ๐ฟ๐ ๐ถ๐ป ๐๐ป๐ฐ๐ถ๐ฑ๐ฒ๐ป๐ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐บ๐ฒ๐ป๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐.
Monitoring and observability tools โ Grafana, Prometheus, traces, logs โ tell you that something is wrong and where. They do not tell you what the host operating system was doing at that moment: which processes were consuming memory, what the kernel OOM killer decided, whether a filesystem was having an I/O contention problem, what the block device queue looked like, what firewall rules were in effect. That data lives on the node, is often ephemeral, and disappears or changes as the system recovers.
The purpose of integrating the widely available open-source ๐๐ผ๐ ๐ฟ๐ฒ๐ฝ๐ผ๐ฟ๐ Linux command into the pipeline is to ๐ฐ๐ฎ๐ฝ๐๐๐ฟ๐ฒ ๐๐ต๐ฎ๐ ๐ข๐ฆ-๐น๐ฒ๐๐ฒ๐น ๐๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ถ๐ฐ๐ฎ๐น๐น๐, ๐ฎ๐ ๐๐ต๐ฒ ๐บ๐ผ๐บ๐ฒ๐ป๐ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฎ๐น๐ฒ๐ฟ๐, ๐ฏ๐ฒ๐ณ๐ผ๐ฟ๐ฒ ๐๐ต๐ฒ ๐ฒ๐๐ถ๐ฑ๐ฒ๐ป๐ฐ๐ฒ ๐ฑ๐ฒ๐ด๐ฟ๐ฎ๐ฑ๐ฒ๐ without requiring a human to log into the node and collect it manually.
More specifically it achieves four things:
๐ฆ๐ฝ๐ฒ๐ฒ๐ฑ ๐ผ๐ณ ๐ฑ๐ถ๐ฎ๐ด๐ป๐ผ๐๐ถ๐. The data is already collected and analysed by the time the SRE opens the alert. They review findings instead of gathering evidence.






