Scroll to top

Integrate Apache Airflow with AWS


Kia Akrami - February 24, 2021 - 0 comments

Design and implement a complete Machine Learning workflow with Amazon Sagemaker

This blog aims to explain an overview of Apache Airflow integration with AWS and design an architecture to build, manage and orchestrate machine learning workflows using Amazon Sagemaker.

Motivation

One of the challenges that we faced in one of the clients’ projects was finding the best optimal solution to run the training job to automatically shut down the engine of EC2 machine, which was a GPU P3.2xlarge. For this problem, I propose designing a CloudFormation stack and using Amazon Sagemaker and Apache Airflow. Once the training job is finished, it automatically shuts down the EC2 machine to reduce the costs.

The proposed system has four main parts:

  • A CloudFormation stack which contains the required resources, e.g., S3 bucket, EC2 instance, etc.
  • A CloudFormation stack which integrates Airflow with AWS
  • Airflow DAG Python Script that integrates and orchestrates all the ML tasks in a ML workflow
  • A Jupyter Notebook in Sagemaker to understand the individual ML tasks in detail, such as data exploration, data preparation, and model training/tuning for a classification problem.

As the volume and complexity of our data processing pipelines increase, we can employ Airflow to simplify the overall process by decomposing it into a series of smaller tasks and coordinate the execution of these tasks as part of a workflow. With Airflow, we can manage workflows as scripts, monitor them via the user interface (UI), and extend their functionality through a set of powerful plugins.

Before we start explaining the system, we need to briefly explain what Apache Airflow is and its pros and cons.

Apache Airflow

Apache Airflow is a way to programmatically author, schedule, and monitor data pipelines. A key difference between Airflow and the other orchestrators is the fact that data pipelines are defined as code and tasks are instantiated dynamically. 

The core components of Airflow:

The web server which is a flask server running with Gunicorn is in charge of serving the UI dashboard. The scheduler which is a daemon built using the Python Daemon library and is responsible for scheduling the data pipelines. The metadata database stores all metadata used by Airflow such as user profiles, and information of the DAGs (Directed Acyclic Graphs). Since Airflow interacts with its metadata using the SQLAlchemy library, we are free to use any database backend as long as it is supported by SQLAlchemy such as MySQL, Oracle or Postgres. The Airflow executor basically determines how the tasks of our data pipeline should be executed. Finally, depending on the executor used, we have the notion of worker, which is a process, or a worker node in our cluster executing our tasks.

The diagram below shows a complete lifecycle of an Airflow task.

Pros and Cons of Airflow 

Pros
1. Has a large and active open source community
2. UI is probably the best among all ETL platforms in the market
3. It is open source so works with any cloud provider
4. It is very well integrable with various systems

Cons
1. Single point of failure for the scheduler
2. AWS does not provide native managed service for Airflow. So, it has to be implemented and maintained by the engineers
Airflow is a very reliable orchestration system which is very well integrable with various cloud systems namely, AWS, Azure etc.

AWS-Airflow Integration
The integration process has four steps:
1. Setting up network and security
2. Storage
3. Cluster
4. Service layer

Setting up Network and Security
The first step is to set up a VPC (Virtual Private Cloud) in AWS to host other resources, for example EC2 and RDS instances. In order to do that we need to set up other network and security components like public and private subnets, NAT gateways etc.

Storage
Once we have the network and security layer, we can deploy the storage layer. Storage layer can vary from installation to installation, the most recommended although is a Postgres/MySQL for the Airflow backend, and Redis for the Celery backend. These two components can be defined in one CloudFormation stack, or, to be more granular, we can create one for each.

To avoid using plain text passwords in the templates, there is a built-in support for AWS Secret Manager. We just have to store the passwords in AWS SM, and referencing the corresponding path.

Cluster
Cluster Layer will handle the basic configurations and the necessary components we need in order to run services on top of this layer. First of all, this layer can contain the Bastion host, which we can use later on as a jump host to access port 8080 (default Airflow UI) and 5555 (celery default UI).
The cluster object itself is an easy one to create, not like the launch configuration of the EC2 instances or the autoscaling service of the instances.

Service layer
The last layer is configuring four different components:
Flower service (Celery), Worker, Scheduler, Web server. In this TaskDefinition resource we specify the environment variables, docker cpu, memory resources, logging configurations. For each of every task we use basically the same docker image, but for each Airflow service we define a different initial script.

The designed CloudFormation stack in a nutshell

The stack is composed mainly of three services -The Airflow web server, the Airflow scheduler, and the Airflow worker. Supporting resources include an RDS to host the Airflow metadata database, an SQS to be used as broker backend, S3 buckets for logs and deployment bundles, an EFS to serve as shared directory, and a custom CloudWatch metric measured by a timed AWS Lambda.

Deployment
The deployment process through CodeDeploy is very flexible and can be reconstructed for each project structure, the only invariant being the Airflow home directory at airflow. It ensures that every Airflow process has the same files and can be upgraded gracefully, but most importantly it makes deployments really fast and easy to begin with.

Amazon SageMaker
Amazon SageMaker is a modular, fully managed machine learning service that enables developers and data scientists to build, train, and deploy ML models at scale. It is a powerful service that helps us to manage our machine learning ecosystem. It enables us to use Jupyter Notebooks to write and develop the codes and test them within the platform.

We employed SageMaker and Airflow operators to design a CloudFormation stack to do an ETL job and then flow the data into our machine learning algorithm for prediction. The figure below depicts the workflow we use for training data, building models and finally prediction using Airflow Python and SageMaker operators.

 The workflow

Related posts