Sunday, 16 February 2025

Apache Airflow Basics and Installation

In the world of data engineering and workflow automation, Apache Airflow has emerged as one of the most popular tools for orchestrating complex data pipelines. Whether you're a data engineer, data scientist, or DevOps professional, understanding the fundamentals of Apache Airflow is essential for building scalable, maintainable, and efficient workflows. In this blog, we’ll dive into the core concepts of Apache Airflow, its architecture, and how to get started with it.


What is Apache Airflow?

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It allows you to define workflows as code, making them more dynamic, reusable, and version-controlled. Airflow is particularly well-suited for managing ETL (Extract, Transform, Load) processes, machine learning pipelines, and other data-driven workflows.

Key features of Apache Airflow include:

  • Dynamic Pipeline Generation: Workflows are defined in Python, enabling dynamic pipeline creation.
  • Extensibility: A rich ecosystem of operators and hooks allows integration with various services like AWS, GCP, databases, and more.
  • Scalability: Airflow can scale to handle thousands of tasks across multiple workers.
  • Monitoring and Logging: Built-in UI for tracking workflow execution and debugging.

Core Concepts of Apache Airflow

To understand Airflow, it’s important to familiarize yourself with its core components:

1. DAG (Directed Acyclic Graph)

  • A DAG is the backbone of Airflow. It represents a collection of tasks with dependencies, organized in a way that ensures they run in a specific order.

  • Directed: Tasks have a clear direction (dependencies).

  • Acyclic: No loops or circular dependencies are allowed.

  • DAGs are defined in Python scripts, making them flexible and dynamic.

2. Tasks

  • A task is a unit of work within a DAG. Each task represents an action, such as running a script, querying a database, or sending an email.

  • Tasks are implemented using Operators (e.g., PythonOperatorBashOperatorSQLOperator).


3. Operators

  • Operators define what a task does. Airflow provides a wide range of built-in operators for common tasks, such as:

    • BashOperator: Executes a Bash command.

    • PythonOperator: Executes a Python function.

    • EmailOperator: Sends an email.

    • Sensor: Waits for a specific condition to be met.

  • You can also create custom operators to suit your needs.


4. Scheduler

  • The Airflow scheduler is responsible for triggering tasks based on their dependencies and schedule intervals.

  • It ensures that tasks are executed in the correct order and at the right time.


5. Executors

  • Executors determine how tasks are run. Airflow supports multiple executors, including:

    • SequentialExecutor: Runs tasks sequentially (for debugging).

    • LocalExecutor: Runs tasks in parallel on a single machine.

    • CeleryExecutor: Distributes tasks across multiple workers (for scalability).

    • KubernetesExecutor: Runs tasks in Kubernetes pods.


Airflow Architecture


Apache Airflow follows a modular architecture, consisting of the following components:

  1. Web Server: Serves the Airflow UI for monitoring and interacting with DAGs.

  2. Scheduler: Parses DAGs, schedules tasks, and triggers them based on dependencies.

  3. Metadata Database: Stores metadata about DAGs, tasks, and their execution history (e.g., PostgreSQL, MySQL).

  4. Executor: Executes tasks based on the chosen execution strategy.

  5. Workers: Perform the actual task execution (used in distributed setups like Celery or Kubernetes).




Installation Steps

Step 1:

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv libpq-dev

Step 2:

mkdir ~/airflow
cd ~/airflow
python3 -m venv airflow_env
source airflow_env/bin/activate

Step 3: 

pip install --upgrade pip setuptools wheel

Step 4: 

export AIRFLOW_HOME=~/airflow

pip install apache-airflow[celery,postgres]==2.7.2 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.2/constraints-$(python -c 'import sys; print(".".join(map(str, sys.version_info[:2])))').txt"


Step 5:

airflow db init

Step 6:

airflow users create \
    --username admin \
    --firstname Airflow \
    --lastname Admin \
    --role Admin \
    --email admin@example.com \
    --password admin123

Step 7:

airflow webserver --port 8081 &

airflow scheduler &

Step 8: 

  • Open http://localhost:8081 in a browser
  • Login with admin / admin123


  • After Server Restart

    source airflow_env/bin/activate
    export AIRFLOW_HOME=~/airflow
    airflow webserver --port 8081 &
    airflow scheduler &


    No comments:

    Post a Comment

    Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

      Introduction In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems...