Orchestration
Data-Continuum uses Apache Airflow as its central orchestration engine.
Airflow is responsible for scheduling, monitoring, and triggering the various data pipelines and training services within the ecosystem.
DAGs
data_continuum_pipeline
A daily scheduled pipeline that performs the following tasks:
1. Verify Connections: Ensures PostgreSQL and MongoDB are accessible.
2. Run PySpark Cleaning: Simulates a spark-submit job to clean and join data across the polyglot databases.
3. Trigger ML Training: Sends an HTTP request to the ML Service (http://api:8001/train) to automatically start a new round of predictive ETA training.