๐—œ๐—ป๐˜๐—ฒ๐—ด๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐˜€๐—ผ๐˜€ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ถ๐—ป ๐—œ๐—ป๐—ฐ๐—ถ๐—ฑ๐—ฒ๐—ป๐˜ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€.

Monitoring and observability tools โ€” Grafana, Prometheus, traces, logs โ€” tell you that something is wrong and where. They do not tell you what the host operating system was doing at that moment: which processes were consuming memory, what the kernel OOM killer decided, whether a filesystem was having an I/O contention problem, what the block device queue looked like, what firewall rules were in effect. That data lives on the node, is often ephemeral, and disappears or changes as the system recovers.

The purpose of integrating the widely available open-source ๐˜€๐—ผ๐˜€ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ Linux command into the pipeline is to ๐—ฐ๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ ๐˜๐—ต๐—ฎ๐˜ ๐—ข๐—ฆ-๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜€๐—ป๐—ฎ๐—ฝ๐˜€๐—ต๐—ผ๐˜ ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ๐—ฎ๐—น๐—น๐˜†, ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐—บ๐—ฒ๐—ป๐˜ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜, ๐—ฏ๐—ฒ๐—ณ๐—ผ๐—ฟ๐—ฒ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ถ๐—ฑ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฑ๐—ฒ๐—ด๐—ฟ๐—ฎ๐—ฑ๐—ฒ๐˜€ without requiring a human to log into the node and collect it manually.

More specifically it achieves four things:

๐—ฆ๐—ฝ๐—ฒ๐—ฒ๐—ฑ ๐—ผ๐—ณ ๐—ฑ๐—ถ๐—ฎ๐—ด๐—ป๐—ผ๐˜€๐—ถ๐˜€. The data is already collected and analysed by the time the SRE opens the alert. They review findings instead of gathering evidence.