Airbnb initially developed Apache Airflow as an open-source platform to programmatically schedule, author, and monitor data pipelines and workflows. As it was created in 2014 by Airbnb, the aim was to help manage increasingly complex workflows at the company; hence remained open-sourced from the start. As a platform developed to schedule, author, and monitor workflows programmatically, Airflow provides different features that help define, schedule, create, monitor, and execute data workflow (Grzemski, 2020). In the realm of the major operation of Airflow, workflow constitutes a sequence of tasks that process data to aid the user in building pipelines. Scheduling is a process of controlling, planning, and optimizing when tasks need to be done while authoring workflows using Airflow means writing Python scripts to generate Directed acyclic graphs (DAGs). DAG constitutes the tasks a user whatnot to run, and they are organized in ways that reflect their dependencies and relations; hence a task is a unit of work within the DAG (Grzemski, 2020). Open Airflow is used on AWS in different scenarios, as described in this context.

One scenario that demands the use of Apache Airflow is managing workflows. Alvarez-Parmar and Maisagoni (2021) argue that as an open-source distributed workflow management platform, Airflow allows users to schedule, orchestrate, as well as monitor workflows. While using Airflow, one is able to orchestrate and automate complex data pipelines. Airflow runs on AWS Fargate alongside Amazon Elastic Container Service (ECS) as an orchestrator, implying that the user does not have to provision as well as manage servers. Airflow operates more than a batch processing platform since it allows the user to develop pipelines to process data and run different complex jobs in a distributed and complex manner. In relation to managing workforces, with AWS Fargate, a user can run the core components of Airflow without creating and managing servers. Similarly, one does not need to guess the capacity of the server to run the Airflow cluster or worry about autoscaling groups and bin packing to maximize resource utilization. Therefore, one practical situation requiring a user to run Airflow is when the user requires managing workflows. Alvarez-Parmar and Maisagoni (2021) note that “Managed Workflows is a managed orchestration service for Apache Airflow that makes it easy for data engineers and data scientists to execute data processing workflows on AWS” Airflow helps users orchestrate workflows as well as manage how they are executed without configuring, managing, and scaling the Airflow architecture. Users who run Airflow on AWS should consider Amazon Managed Workflows for Apache Airflow since it helps to set up Airflow, provisioning and autoscaling capacity (storage and compute), keeps Airflow up-to-date, and automates snapshots. Hence, in relation to using Airflow for managing workflows, Airflow provides a reliable and scalable framework for users to orchestrate data workflows. This enables data engineers to transform, extract, and load data from different sources. With Airflow’s operators, integration with various data systems like cloud storage, databases, and data warehouses is made easier.

In a different discussion on the practical use of Airflow to create on-demand or scheduled workflows that process complicated data from different data providers, Oliveira and Raditchkov (2019) argue that it becomes easier to orchestrate big data workflows with Apache Airflow. Large companies that run big data ETL workflows on the AWS work at a scale where most internal end-users are serviced and several concurrent pipelines are also serviced. With the continuous urge to extend and update the big data platform as a way of keeping up with the latest big data processing frameworks, what is required is an efficient architecture that simplifies big data platform management and enhances easy access to the big data applications. Therefore, using Airflow on AWS, centralized platform teams can maintain their big data platform, service various concurrent ETL workflows, and simplify operational tasks needed to attain the process. While managing workflows on AWS, the Airflow system uses Amazon EMR and Genie as open-source technologies. Amazon EMR provides a big data platform where workflow orchestration, execution, and authoring are done, while Genie offers a centralized REST API for effective big data job submission, central configuration management, dynamic job routing, and abstraction of Amazon EMR clusters. In the entire process of workflow management, Airflow offers a platform to enhance job orchestration, allowing the user to programmatically author, schedule, as well as monitor the complex data pipelines (Oliveira and Raditchkov, 2019). The role of Amazon EMR in the process is to offer a managed cluster platform that scales and runs Apache Spark and Hadoop, as well as other big data frameworks. The diagram shown below shows the use of Airflow in AWS to enhance big data workflows;

 

Airflow in AWS on big data Workflows Management (Oliveira and Raditchkov, 2019)

Other than using Airflow to support complex workflows, a different practical scenario that demands using Airflow in AWS is coordinating extract, transform, and load jobs. Airflow is used to perform ETL jobs because it works on a concept known as operators, which denote the logical blocks in ETL workflows (Sinha, 2021). To perform an airflow ETL job, a user is required to have an AWS account and Airflow installed in the system. Extract, transform, and load (ETL) job transforms raw data into needful datasets and finally to actionable insight. Anany (2018) argues that an ETL job reads data from various data sources and applies different transformations to the data before writing the results to a target where such data becomes ready for consumption. The ETL job targets and sources are relational databases in Amazon S3, which helps build a data lake in AWS. AWS provides AWS Glue as a service that deploys and authors ETL jobs. AWS Glue “is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics”
Despite the case, other AWS services that manage and implement ETL jobs entail the AWS Data Migration Service, Amazon Athena, and Amazon EMR. Therefore, in a practical scenario where one needs to orchestrate ETL jobs and workflow that entails different ETL technologies, with Airflow in AWS Glue and AWS DMS, the user will easily chain ETL jobs. A good scenario that would require the use of orchestration of ETL jobs using Airflow in AWS is when a business user wants to answer questions related to different datasets. In this case, if a user wants to find the correlation between forecasted sales revenue and online user engagement metrics like mobile users, website visits, or desktop users, the user will follow different ETL workflow steps, including the process of the sales dataset (PSD), Processing of the Marketing dataset (PMD), and Joining the marketing and Sales datasets (JMSD) (Anany, 2018). After doing so, the user will implement the ETL workflow using AWS Glue by chaining ETL jobs using job triggers. The diagram shown below demonstrates how the user will manage ETL workflow with AWS Glue based on the case scenario;

Using AWS Glue to solve the case scenario (Anany, 2018)

Despite the case, the following ETL architecture demonstrates how running Airflow on AWS to coordinate, extract, transform, and load jobs.

 

Running Airflow on AWS to coordinate ESL Jobs (Anany, 2018)

Finally, a different practical scenario that would necessitate a user to run Airflow on AWS is when the user wants to prepare and manage machine learning data and workflows. Cantu (2023) argues that Airflow efficiently helps manage machine learning workflows since it allows machine learning engineers and data scientists to schedule and define tasks for model training, data pre-processing, evaluation, and deployment. Hence, the ability of Airflow to schedule tasks and handle dependencies in a distributed framework positively impacts the management of the end-to-end lifecycle of the machine learning models. Machine learning workflows automate and orchestrate ML tasks’ sequences by allowing data transformation and collection. Training, evaluating, and testing the ML model to attain the intended outcome follows. Most customers use Airflow to perform scheduling, authoring, as well as monitoring of multi-stake workflows. AWS automates Amazon SageMaker tasks in an end-to-end workflow. Thus, one can automate their publishing datasets to Amazon S3, train ML model on their data, and deploy their model for prediction. Therefore, by running Airflow on AWS, one can easily “prepare data in AWS Glue before they train a model on Amazon SageMaker and then deploy the model to the production environment to make inference calls” (Thallam and Dominguez, 2019). Automating and orchestrating the tasks across several services makes it easier to build reproducible and repeatable Machine Learning workflows that can be shared among data scientists and engineers. When Airflow runs on AWS, preparation, and management of machine learning data and workflows are effectively done because AWS Step Functions monitors Amazon SageMaker to ensure that the jobs succeed. AWS uses different structures like built-in-error handling, state management, visual console, and parameter passing that monitor the user’s ML workflows to enhance success.
References

Alvarez-Parmar, R. and Maisagoni, C. (2021). Running Airflow on AWS Fargate. Amazon AWS. https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/

Anany, M. (2018). Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda. Amazon AWS. https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/

Cantu, J. (2023). Mastering Workflow Management and Orchestration with Apache Airflow. Medium. https://medium.com/@jesus.cantu217/apache-airflow-a-comprehensive-guide-to-workflow-management-and-orchestration-bf1372e11920#:~:text=It%20provides%20a%20scalable%20and,cloud%20storage%2C%20and%20data%20warehouses.

Cantu, J. (2023). Mastering Workflow Management and Orchestration with Apache Airflow. Medium. https://medium.com/@jesus.cantu217/apache-airflow-a-comprehensive-guide-to-workflow-management-and-orchestration-bf1372e11920#:~:text=It%20provides%20a%20scalable%20and,cloud%20storage%2C%20and%20data%20warehouses.

Grzemski, S. (2020). Highly available Airflow cluster in Amazon AWS. Getindata. https://getindata.com/blog/highly-available-airflow-amazon-aws/

Oliveira, F. and Raditchkov, J. (2019). Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 1. Amazon AWS https://aws.amazon.com/blogs/big-data/orchestrate-big-data-workflows-with-apache-airflow-genie-and-amazon-emr-part-1/

Sinha, v. (2021). Understanding Airflow ETL: 2 Easy Methods. Hevo https://hevodata.com/learn/airflow-etl-guide/#etljob

Thallam, R. and Dominguez, M. (2019). Build end-to-end machine learning workflows with Amazon SageMaker and Apache Airflow. Amazon AWS. https://aws.amazon.com/blogs/machine-learning/build-end-to-end-machine-learning-workflows-with-amazon-sagemaker-and-apache-airflow/

Get our latest news right away!

You have Successfully Subscribed!