Improving DAG Failure Detection in Airflow Using AI Techniques
Apache Airflow is a powerful tool for orchestrating ETL pipelines, but failure handling in large-scale environments remains largely reactive. Identifying root causes and detecting silent data issues still requires significant manual effort. In this article, we'll present an approach implemented in a production data platform to improve failure detection and diagnosis using a combination of large language models (LLMs), statistical methods, and traditional machine learning.
Log-Based Failure Classification
Airflow provides extensive logging capabilities, but analyzing these logs manually is time-consuming and prone to errors. We used a sequence-to-sequence LLM to classify log messages into categories such as INFO, WARNING, or ERROR. This model was trained on a dataset of labeled log samples.
Model Architecture












