Mastering Data Workflows with Apache Airflow: A Fun and Comprehensive Guide for Beginners
Hello network!
So, this time I'm here to share a bit of what I've learned about a tool I started studying this month—Airflow. I felt the need to dive deeper into orchestration, and I must say, I'm really enjoying Airflow. Also, I'm taking a shot at writing in English, so please bear with me if there are any mistakes, haha. I hope this content is useful for those who are just starting out.
1. What is Airflow?
Imagine trying to juggle a dozen flaming torches while riding a unicycle. That’s essentially what managing data pipelines can feel like. Enter Apache Airflow – the cool, composed juggler who makes this circus act look like a walk in the park. Developed by Airbnb (because apparently, they had time to revolutionize data engineering between hosting guests), Airflow is your go-to platform for creating, scheduling, and monitoring programmable workflows.
1.1 Advantages of Airflow
1.2 Summary of Airflow Architecture
Airflow’s architecture is a bit like a high-tech kitchen, with each component playing a crucial role in delivering the perfect dish.
1.3 Key Concepts
2. Types of Operators in Apache Airflow
Operators are the backbone of your workflows in Apache Airflow. They define the tasks you need to perform, ranging from running a Python script to querying a database. Here’s a rundown of some key operators:
2.1 Common Operators
BashOperator
from airflow.operators.bash import BashOperator
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
PythonOperator
from airflow.operators.python import PythonOperator
def my_function():
print("Hello from PythonOperator")
t2 = PythonOperator(
task_id='run_my_function',
python_callable=my_function,
)
EmailOperator
from airflow.operators.email import EmailOperator
email = EmailOperator(
task_id='send_email',
to='example@example.com',
subject='Airflow Alert',
html_content='The task has completed.',
)
2.2 Database Operators
MySqlOperator
from airflow.providers.mysql.operators.mysql import MySqlOperator
t3 = MySqlOperator(
task_id='run_mysql_query',
sql='SELECT * FROM my_table',
mysql_conn_id='my_mysql_connection',
)
PostgresOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
t4 = PostgresOperator(
task_id='run_postgres_query',
sql='SELECT * FROM my_table',
postgres_conn_id='my_postgres_connection',
)
2.3 Data Movement Operators
S3ToRedshiftOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
t5 = S3ToRedshiftOperator(
task_id='load_s3_to_redshift',
schema='public',
table='my_table',
s3_bucket='my_bucket',
s3_key='my_key',
copy_options=['csv'],
redshift_conn_id='my_redshift_connection',
)
2.4 Sensor Operators
S3KeySensor
from airflow.providers.amazon.aws.sensors.s3_key import S3KeySensor
t7 = S3KeySensor(
task_id='check_for_file',
bucket_name='my_bucket',
bucket_key='my_key',
aws_conn_id='my_aws_connection',
)
HttpSensor
from airflow.sensors.http_sensor import HttpSensor
t8 = HttpSensor(
task_id='check_http',
http_conn_id='my_http_connection',
endpoint='health',
response_check=lambda response: response.status_code == 200,
)
Recommended by LinkedIn
3. How It Works
Step 1: Installation
First things first, let’s get Airflow installed. The easiest way is to use pip:
pip install apache-airflow
Note: I’m a fan of using poetry for dependency management.
Step 2: Environment Initialization
Once installed, initialize the Airflow environment. This sets up the necessary directories and the default SQLite database for metadata storage.
airflow db init
Step 3: Airflow Configuration
The main configuration file for Airflow is airflow.cfg. Here you can tweak settings like the database backend, the executor, and webserver configurations.
Step 4: Creating a DAG
A DAG (Directed Acyclic Graph) is your blueprint for tasks and their dependencies. Here’s a simple DAG in Python:
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 1, 1),
'retries': 1,
}
dag = DAG(
'my_first_dag',
default_args=default_args,
description='A simple DAG',
schedule_interval='@daily',
)
start = DummyOperator(
task_id='start',
dag=dag,
)
end = DummyOperator(
task_id='end',
dag=dag,
)
start >> end
Step 5: Adding the DAG to Airflow
Save the DAG file in a directory called dags, which should be located in your main Airflow directory (AIRFLOW_HOME/dags).
Step 6: Starting the Webserver and Scheduler
Airflow’s webserver and scheduler are the dynamic duo for monitoring and distributing tasks. Start both services:
airflow webserver --port 8080
airflow scheduler
Step 7: Monitoring and Execution
Fire up your browser and head to http://localhost:8080. Here, you can view, monitor, and manage your DAGs and tasks.
4. Installing Apache Airflow
Now let’s roll up our sleeves and get Airflow running with Docker Compose. Here’s how you do it:
Step 1: Create a docker-compose.yaml File
First, create a docker-compose.yaml file. You can grab it directly from the Airflow documentation:
Step 2: Create an .env File
Create an .env file to store environment variables. This keeps your configuration clean and organized.
Step 3: Add the Following Lines to the .env File
AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2
AIRFLOW_UID=50000
Step 4: OPTIONAL - Create a Virtual Environment with Poetry
Using Poetry helps manage dependencies smoothly. If you prefer, create a virtual environment:
poetry init
poetry shell
Step 5: Start Docker Compose
Navigate to the directory containing your docker-compose.yaml file and run:
cd materials
docker-compose up -d
Step 6: Check Your Setup
After Docker Compose finishes, your directories should look something like this:
Alternatively, you can use the command:
docker-compose ps
Step 8: Access Airflow in Your Browser
Open your browser and navigate to http://localhost:8080. Log in with the username and password airflow.
Step 9: Exercise - Pausing and Triggering a DAG
Now that everything is set up, let’s do a simple exercise. Go ahead and pause/unpause a DAG from the Airflow web interface.
That's it for this article! But don't worry, there's more to come. I'll be diving into some intermediate and advanced examples in future articles, showing you how to harness the full power of Airflow. Stay tuned, and let's keep learning together!
Congratulations lucas 👏