Mastering Data Workflows with Apache Airflow: A Fun and Comprehensive Guide for Beginners
https://www.ambientelivre.com.br/media/k2/items/cache/f9a9b7c9f33a923e5475b478a62125ae_XL.jpg

Mastering Data Workflows with Apache Airflow: A Fun and Comprehensive Guide for Beginners

Hello network!

So, this time I'm here to share a bit of what I've learned about a tool I started studying this month—Airflow. I felt the need to dive deeper into orchestration, and I must say, I'm really enjoying Airflow. Also, I'm taking a shot at writing in English, so please bear with me if there are any mistakes, haha. I hope this content is useful for those who are just starting out.


1. What is Airflow?

Imagine trying to juggle a dozen flaming torches while riding a unicycle. That’s essentially what managing data pipelines can feel like. Enter Apache Airflow – the cool, composed juggler who makes this circus act look like a walk in the park. Developed by Airbnb (because apparently, they had time to revolutionize data engineering between hosting guests), Airflow is your go-to platform for creating, scheduling, and monitoring programmable workflows.

1.1 Advantages of Airflow

  1. Workflow Orchestration: Airflow lets you manage complex workflows with the elegance of a maestro conducting an orchestra, ensuring every task hits the right note at the right time.
  2. Scalability: Whether you’re processing a trickle of data or a tsunami, Airflow scales up to handle thousands of tasks like a pro.
  3. Flexibility: It’s the Swiss Army knife of data tools, integrating seamlessly with a wide array of systems, databases, APIs, and cloud services.
  4. Visualization and Monitoring: With its intuitive web interface, you can monitor workflows like a hawk and tweak them in real-time. Think of it as mission control for your data pipelines.
  5. Active Community and Documentation: Got a problem? There’s a whole community ready to help, along with extensive documentation and plugins to make your life easier.

1.2 Summary of Airflow Architecture

Airflow’s architecture is a bit like a high-tech kitchen, with each component playing a crucial role in delivering the perfect dish.

  1. Scheduler: The brain of the operation, deciding when tasks should run based on dependencies and schedules.
  2. Executor: The muscle, managing task execution. Choose your fighter: SequentialExecutor, LocalExecutor, CeleryExecutor, or KubernetesExecutor.
  3. Workers: The hands-on team executing the tasks. Need more power? Just add more workers.
  4. Web Server: Your shiny dashboard for managing, viewing, and monitoring workflows. It’s like having Google Analytics for your data tasks.
  5. Metadatabase: The memory bank, storing the state of tasks, workflows, and configurations.
  6. Triggerer: Handles sensors waiting for specific conditions, doing so efficiently without hogging resources.
  7. Queues: Think of these as the task traffic controllers, managing the distribution and load balancing among workers.

Article content
https://mindmajix.com/apache-airflow-tutorial


1.3 Key Concepts

  • DAGs (Directed Acyclic Graphs): The blueprint of your workflows, showing how tasks interconnect without looping back.
  • Tasks and Operators: Tasks are your action items, while operators define how these actions are executed, like running Python scripts, transferring data, or querying databases.
  • XComs (Cross-Communication): A fancy way of saying "tasks passing notes," allowing them to share data and results.

Article content

2. Types of Operators in Apache Airflow

Operators are the backbone of your workflows in Apache Airflow. They define the tasks you need to perform, ranging from running a Python script to querying a database. Here’s a rundown of some key operators:

2.1 Common Operators

BashOperator

  • Description: Runs Bash (shell) commands. Handy for all your Unix-y needs.
  • Usage: Ideal for simple scripts and commands.
  • Example:

from airflow.operators.bash import BashOperator

t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
)        

PythonOperator

  • Description: Executes Python functions. Your one-stop-shop for all things Python.
  • Usage: Super versatile for anything from API calls to data manipulation.
  • Example:

from airflow.operators.python import PythonOperator

def my_function():
    print("Hello from PythonOperator")

t2 = PythonOperator(
    task_id='run_my_function',
    python_callable=my_function,
)        

EmailOperator

  • Description: Sends emails. Perfect for those "Hey, look what I did!" moments.
  • Usage: Notification and alerts.
  • Example:

from airflow.operators.email import EmailOperator

email = EmailOperator(
    task_id='send_email',
    to='example@example.com',
    subject='Airflow Alert',
    html_content='The task has completed.',
)        

2.2 Database Operators

MySqlOperator

  • Description: Executes SQL commands in MySQL.
  • Usage: Query, update, or maintain your MySQL databases.
  • Example:

from airflow.providers.mysql.operators.mysql import MySqlOperator

t3 = MySqlOperator(
    task_id='run_mysql_query',
    sql='SELECT * FROM my_table',
    mysql_conn_id='my_mysql_connection',
)        

PostgresOperator

  • Description: Executes SQL commands in PostgreSQL.
  • Usage: Similar to MySqlOperator but for PostgreSQL.
  • Example:

from airflow.providers.postgres.operators.postgres import PostgresOperator

t4 = PostgresOperator(
    task_id='run_postgres_query',
    sql='SELECT * FROM my_table',
    postgres_conn_id='my_postgres_connection',
)        

2.3 Data Movement Operators

S3ToRedshiftOperator

  • Description: Moves data from S3 to Redshift. Perfect for your AWS ETL needs.
  • Usage: Ideal for ETL (Extract, Transform, Load) in AWS.
  • Example:

from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator

t5 = S3ToRedshiftOperator(
    task_id='load_s3_to_redshift',
    schema='public',
    table='my_table',
    s3_bucket='my_bucket',
    s3_key='my_key',
    copy_options=['csv'],
    redshift_conn_id='my_redshift_connection',
)        

2.4 Sensor Operators

S3KeySensor

  • Description: Checks for the existence of a file in S3. Like a vigilant watchdog for your data.
  • Usage: Waits for file arrivals before processing.
  • Example:

from airflow.providers.amazon.aws.sensors.s3_key import S3KeySensor

t7 = S3KeySensor(
    task_id='check_for_file',
    bucket_name='my_bucket',
    bucket_key='my_key',
    aws_conn_id='my_aws_connection',
)        

HttpSensor

  • Description: Checks the response from a URL. Think of it as your web service health checker.
  • Usage: Verifies if a web service is up and running.
  • Example:

from airflow.sensors.http_sensor import HttpSensor

t8 = HttpSensor(
    task_id='check_http',
    http_conn_id='my_http_connection',
    endpoint='health',
    response_check=lambda response: response.status_code == 200,
)        

3. How It Works

Step 1: Installation

First things first, let’s get Airflow installed. The easiest way is to use pip:

pip install apache-airflow        

Note: I’m a fan of using poetry for dependency management.

Step 2: Environment Initialization

Once installed, initialize the Airflow environment. This sets up the necessary directories and the default SQLite database for metadata storage.

airflow db init        

Step 3: Airflow Configuration

The main configuration file for Airflow is airflow.cfg. Here you can tweak settings like the database backend, the executor, and webserver configurations.

Step 4: Creating a DAG

A DAG (Directed Acyclic Graph) is your blueprint for tasks and their dependencies. Here’s a simple DAG in Python:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
}

dag = DAG(
    'my_first_dag',
    default_args=default_args,
    description='A simple DAG',
    schedule_interval='@daily',
)

start = DummyOperator(
    task_id='start',
    dag=dag,
)

end = DummyOperator(
    task_id='end',
    dag=dag,
)

start >> end        

Step 5: Adding the DAG to Airflow

Save the DAG file in a directory called dags, which should be located in your main Airflow directory (AIRFLOW_HOME/dags).

Step 6: Starting the Webserver and Scheduler

Airflow’s webserver and scheduler are the dynamic duo for monitoring and distributing tasks. Start both services:

airflow webserver --port 8080
airflow scheduler        

Step 7: Monitoring and Execution

Fire up your browser and head to http://localhost:8080. Here, you can view, monitor, and manage your DAGs and tasks.


4. Installing Apache Airflow

Now let’s roll up our sleeves and get Airflow running with Docker Compose. Here’s how you do it:

Step 1: Create a docker-compose.yaml File

First, create a docker-compose.yaml file. You can grab it directly from the Airflow documentation:

Airflow Docker Compose File

Step 2: Create an .env File

Create an .env file to store environment variables. This keeps your configuration clean and organized.

Article content

Step 3: Add the Following Lines to the .env File

AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2
AIRFLOW_UID=50000        

Step 4: OPTIONAL - Create a Virtual Environment with Poetry

Using Poetry helps manage dependencies smoothly. If you prefer, create a virtual environment:

poetry init
poetry shell        

Step 5: Start Docker Compose

Navigate to the directory containing your docker-compose.yaml file and run:

cd materials
docker-compose up -d        

Step 6: Check Your Setup

After Docker Compose finishes, your directories should look something like this:

Article content
Article content

Alternatively, you can use the command:

docker-compose ps        
Article content

Step 8: Access Airflow in Your Browser

Open your browser and navigate to http://localhost:8080. Log in with the username and password airflow.

Article content
Article content

Step 9: Exercise - Pausing and Triggering a DAG

Now that everything is set up, let’s do a simple exercise. Go ahead and pause/unpause a DAG from the Airflow web interface.

Article content

That's it for this article! But don't worry, there's more to come. I'll be diving into some intermediate and advanced examples in future articles, showing you how to harness the full power of Airflow. Stay tuned, and let's keep learning together!


To view or add a comment, sign in

More articles by Lucas Faust

Others also viewed

Explore content categories