With the introduction of the Airflow Kubernetes Executor, my company is considering adopting Airflow as our primary ETL framework. I’ve read countless articles online all with slight variations of deploying Airflow on a Kubernetes cluster in the cloud. I’m hoping that someone with experience deploying Airflow in Kubernetes can clear up some of my confusion.
In many of the articles I’ve read, the Airflow configuration in Kubernetes usually looks something like this:
- A Kubernetes pod runs the Airflow web server and a separate pod runs the Airflow Scheduler
- DAGS get synced from a Git repo to a file storage (e.g. Google Cloud File Store) on an interval by a Kubernetes cronjob
- The Airflow Web Server, Airflow Scheduler, and any spun-up Airflow workers have access to persistent volume claims (PVC) for synced DAGs and log files.
Alternatively some configurations seem to sync the DAGs locally to a pod when the container is first started, or store logs in a bucket.
In practice I’m not sure if this is how production environments normally run.
1. For a production Airflow Kubernetes deployment, what would be the recommended code repositories? Should DAG definition and DAG business logic be split into separate repos?
I’m guessing that it’s best practice to separate the DAG configuration, DAG business logic/dependencies, and Airflow container. Consequently, I’m envisioning the following:
- A repo containing the Airflow Docker image
- A repo containing the Airflow DAG definitions
- A repo containing the Airflow DAG Tasks (business logic)
In this setup the Airflow DAGs would be git-synced onto the pod or accessible via a PVC. The actual business logic for each task in that workflow would be in a separate repo that contains a
Dockerfile. That repo would get spun up as a side-car container on the worker pod that the individual DAG tasks would then make entry-point/executable calls to.
2. The Kubernetes Executor advertises having the benefit that within a given DAG or Airflow deployment different workflows may safely have conflicting dependencies. How is this achieved in practice? By containerization of DAGs or Tasks?
3. What is the recommended Docker Airflow image to use as starting point with Airflow 2?
Source: Docker Questions