Thursday, 20 February 2025

Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

 



Introduction

In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems, such as relational databases, data warehouses, and distributed file systems like Hadoop. Moving data between these systems efficiently is a critical task. Apache Sqoop is a powerful tool designed to simplify and automate the transfer of data between relational databases and Hadoop ecosystems. This blog provides an in-depth overview of Apache Sqoop, its purpose, architecture, key components, and practical examples of using Sqoop to move data between MySQL and Hadoop components like HDFS, Hive, and HBase.

Purpose of Apache Sqoop in the Hadoop Ecosystem

Apache Sqoop (SQL-to-Hadoop) is a command-line interface (CLI) tool that facilitates bidirectional data transfer between:

- Relational Databases (RDBMS) like MySQL, Oracle, PostgreSQL, and SQL Server.

- Hadoop Ecosystem Components like HDFS, Hive, and HBase.

Key Use Cases:

1. Data Ingestion: Import data from RDBMS into Hadoop for processing and analysis.

2. Data Export: Export processed data from Hadoop back to RDBMS for reporting or further analysis.

3. Data Integration: Integrate structured data from databases with unstructured/semi-structured data in Hadoop.

4. Efficient Data Transfer: Optimize data transfer using parallel processing and connectors.

Fundamentals of Apache Sqoop

Sqoop works by translating commands into MapReduce jobs that execute the data transfer. It uses connectors to interact with different databases and Hadoop components. Sqoop supports incremental data loads, parallel data transfers, and fault tolerance.

Key Features:

- Parallel Data Transfer: Sqoop uses multiple mappers to transfer data in parallel, improving performance.

- Connectors: Supports connectors for various databases and Hadoop components.

- Incremental Imports: Allows importing only new or updated data.

- Compression: Supports data compression during transfer to save storage and bandwidth.

Architecture of Apache Sqoop



Sqoop's architecture consists of the following components:

1. Sqoop Client: The command-line interface used to submit Sqoop jobs.

2. Connectors: Plugins that enable Sqoop to interact with different databases and Hadoop components.

3. Metadata Store: Stores metadata about Sqoop jobs (e.g., last imported row).

4. MapReduce Framework: Executes the data transfer tasks in parallel.

Key Components of Apache Sqoop

1. Sqoop Import: Transfers data from RDBMS to HDFS, Hive, or HBase.

2. Sqoop Export: Transfers data from HDFS, Hive, or HBase to RDBMS.

3. Sqoop Job: Allows saving and reusing Sqoop commands.

4. Sqoop Merge: Merges incremental data with existing data in HDFS.



Internal Working Mechanism


Sqoop is a tool designed to transfer data between relational databases (RDBMS) and Hadoop Distributed File System (HDFS). When performing an import, Sqoop first connects to the RDBMS using JDBC and retrieves metadata about the table being imported. It then generates a map-only MapReduce job, where each mapper is responsible for transferring a portion of the data. Sqoop divides the data into splits based on a primary key or a specified column, ensuring parallel data transfer. Each mapper executes a SQL query to fetch its assigned split of data from the database. The retrieved data is then written to HDFS in a specified format (e.g., text, Avro, or Parquet). Throughout the process, Sqoop handles data type conversions, ensuring compatibility between the RDBMS and HDFS. This mechanism allows for efficient, parallelized data transfer from structured databases to Hadoop as shown in the above diagram.

Installing Sqoop in Ubuntu


Here we are going to discusses how to install Sqoop in an Ubuntu VM.

Prerequisites

Make sure you have the following installed:


Java (JDK 8)
Hadoop (HDFS & YARN)
Hive (Optional, for Hive integration)
A relational database (MySQL, PostgreSQL, etc.)
JDBC Connector for your database

Step 1: Update System


sudo apt update && sudo apt upgrade -y


Step 2: Download Sqoop 1.4.7


cd /home/hdoop
sudo wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.tar.gz


Step 3: Extract the Sqoop Archive


sudo tar -xvzf sqoop-1.4.7.tar.gz


Step 4: Create a Symlink


ln -s sqoop-1.4.7 sqoop


Step 5: Set Environment Variables


Edit .bashrc to set the environment variables. Save it and activate it.

sudo nano ~/.bashrc

#JAVA Related Options
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/common/lib
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

#HIVE Related Options
export HIVE_HOME=/home/hdoop/hive
export PATH=$PATH:$HIVE_HOME/bin
#export HIVE_AUX_JARS_PATH=file:///home/hdoop/spark-2.4.8-bin-without-hadoop/jars/

#Hcatalog Related Options
export HCATALOG_HOME=$HIVE_HOME/hcatalog
export PATH=$HCATALOG_HOME/bin:$PATH
export HCAT_HOME=$HCATALOG_HOME

#SQOOP related Options
export SQOOP_HOME=/home/hdoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

source ~/.bashrc

Step 6: Configure the sqoop-env.sh


Create  the sqoop-env.sh in sqoop/conf folder using the provided template. And add the following variables. Save and Exit.

cd sqoop/conf
cp sqoop-env-template.sh sqoop-env.sh

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/hdoop/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/hdoop/hadoop

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/home/hdoop/hive

Step 7: Download Required library to sqoop/lib folder


Please ensure following jars are available in sqoop/lib folder. Either copy these from already existing Hadoop and Spark lib folder or directly from maven or other repository using wget command.

ant-contrib-1.0b3.jar
ant-eclipse-1.0-jvm1.2.jar
avro-tools-1.8.0.jar
guava-18.0.jar
kite-data-core-0.17.0.jar
kite-data-mapreduce-0.17.0.jar
mysql-connector-java-8.0.28.jar
sqoop-1.4.7.jar

Step 8: Verify if Sqoop works fine 


Verify if Sqoop works fine with the following commands.
 
sqoop version


Note: You might get warning related to missing HCATALOG, HBASE if you don't have those configured in your system.

Sample Sqoop Scripts


1. Import Data from MySQL to HDFS

This script imports data from a MySQL table into HDFS.

sqoop import \
--connect jdbc:mysql://localhost:3306/sakila \
--username hdoop \
-P \
--table store \
--target-dir /datalake/bronze/sakila/store2 \
--as-textfile \
--num-mappers 1

2. Import Data from MySQL to Hive

This script imports data from a MySQL table into Hive table.

sqoop import \
--connect jdbc:mysql://localhost:3306/sakila \
--username hdoop \
-P \
--table store \
--hive-import \
--hive-database default \
--hive-table sakila_store2 \
--create-hive-table \
--num-mappers 2


Best Practices for Using Sqoop

1. Use Incremental Imports: For large datasets, use `--incremental` to import only new or updated data.

2. Optimize Mappers: Adjust the number of mappers (`--m`) based on the dataset size and cluster resources.

3. Compress Data: Use `--compress` to reduce storage and transfer time.

4. Validate Data: Use `--validate` to ensure data integrity after transfer.

5. Secure Credentials: Avoid hardcoding passwords; use tools like `sqoop.password.file`.


Conclusion

Apache Sqoop is a vital tool in the Hadoop ecosystem, enabling seamless data transfer between relational databases and Hadoop components like HDFS, Hive, and HBase. Its ease of use, parallel processing capabilities, and support for various databases make it a preferred choice for data ingestion and export tasks. By mastering Sqoop, data engineers can efficiently integrate structured data into big data workflows, unlocking the full potential of Hadoop for analytics and processing.

Whether you're importing data from MySQL to HDFS or exporting processed results back to a relational database, Sqoop simplifies the process, making it an indispensable tool in the big data toolkit.

You can find more articles on data technologies here


Big Data Capstone Project: HR Analytics



In this project we will building HR Analytics to study key aspects of HR Operations. We will ingest the transactional data and reference data from MySQL database (backend of the HRMS) to  our Big Data Ecosystem build using Apache Opensource Big Data Frameworks (Hadoop, Hive, HBase, Spark etc.). Then we will perform data cleansing, data Transformation, and implement the business logic to have key metrices developed in the form of graphs, charts table etc in the Apache Superset BI tool, which can be consumed by key HR personnel and executives.

Architecture





Source Data Preparation


We will first prepare our source data required for this project. Please follow the below steps.


Step 1: Create a database named employees in MySQL database. Run the below commands:


sudo mysql

CREATE DATABASE employees;
USE employees;

exit;

Then exit the mysql.

Step 2: Download the HR Database 


Download the HR Database from this github repository.  Please execute the below commands.

git clone https://github.com/datacharmer/test_db.git

cd test_db

sudo mysql < employees.sql

Step 3: Verify if the tables has been created with sample data


Run the mysql terminal and verify if the tables are created with sample data, execute the below commands:

sudo mysql

USE employees;
SHOW TABLES;
SELECT * FROM employees LIMIT 100;







Data Ingestion


In this step we will ingest the data from the MySQL Database (backend of HRMS) to Hadoop platform using Sqoop.


One time Historical Data Ingestion


Please develop and execute Sqoop scripts to ingest the one time historical data from the below tables to Hive directly.

 

Periodic Batch Ingestion


Develop and execute Sqoop command to load only the delta data incrementally to the already created hive tables for the same MySQL tables.



Scheduling


Schedule the Sqoop ingestion job to run daily at 01:00 AM in Apache Airflow.

Visualization


Create a HR Analytics Dashboard in Apache Superset with following graphs.

  • Bar Chart with number of employee per department
  • Trendline of number of employee joining in each department over the year
  • Trendline of number of employee leaving in each department
  • Bar Chart to show average age of employees in each department
  • Trendline to show male and female hiring rates over the year.
  • Pie Chart to show number of employees in each age group for each department. Create age groups as mentioned below.
    • 21 - 30
    • 31 - 40
    • 41 - 50
    • 50+
  • Trendline to show the salary cost of each department over the year.
  • Pie Chart to show the number of employees in each title for different departments.
  • Table to show the top 10 active highest paid salary employees (employee id, employee full name, age, department name, title, gender)

Serving Hive Tables


To create the above visualization create additional hive tables that will serve as input to the Apache Superset Charts. Create hql files to load the data (preferably incremental) to these serving hive tables.

Scheduling


Schedule the hql in Apache Airflow to run in a daily manner post the data ingestion run successfully.

Sunday, 16 February 2025

Apache Superset Fundamentals: Architecture, Key Components, Installations and Connection to Apache Hive


Apache Superset is a modern, enterprise-ready business intelligence (BI) tool that enables data exploration, visualization, and dashboard creation. It is open-source, highly scalable, and designed to handle large datasets with ease. Whether you're a data engineer, analyst, or business user, Superset provides an intuitive interface to derive insights from your data. In this blog, we’ll dive into the fundamentals of Apache Superset, its architecture, and its key components including installation.

What is Apache Superset?


Apache Superset is a web-based application that allows users to create interactive dashboards, charts, and visualizations. It supports a wide range of data sources, including SQL-based databases (e.g., PostgreSQL, MySQL, BigQuery) and cloud data warehouses (e.g., Snowflake, Redshift). Superset is designed to be self-service, meaning even non-technical users can explore data and build visualizations without writing code.

One of Superset’s standout features is its ability to handle large datasets efficiently. It leverages a lightweight backend and a powerful SQL engine to query and visualize data in real-time. Additionally, Superset is highly extensible, allowing developers to customize its functionality through plugins and APIs. 

Apache Superset Architecture. 


Superset’s architecture is modular and designed for scalability. It consists of several key components that work together to deliver a seamless user experience. 

Let’s break down the architecture:

 

1. Web Server

The web server is the front-facing component of Superset. It serves the user interface (UI) and handles user interactions. Superset’s UI is built using modern web technologies like React and Ant Design, providing a responsive and intuitive experience. The web server also manages authentication, authorization, and session management.

2. Metadata Database

Superset uses a metadata database to store information about dashboards, charts, datasets, and user permissions. By default, Superset uses SQLite for lightweight setups, but for production environments, it’s recommended to use a more robust database like PostgreSQL or MySQL. The metadata database ensures that all configurations and user-generated content are persisted.

3. SQL Lab

SQL Lab is Superset’s SQL IDE (Integrated Development Environment). It allows advanced users to write and execute SQL queries directly against their data sources. SQL Lab is particularly useful for data exploration and ad-hoc analysis. It supports querying multiple databases simultaneously and provides features like query history, result visualization, and query sharing.

4. Data Sources

Superset connects to a wide variety of data sources through SQLAlchemy, a Python SQL toolkit. Supported databases include relational databases (e.g., MySQL, PostgreSQL), columnar databases (e.g., Apache Druid), and cloud data warehouses (e.g., Google BigQuery, Snowflake). Superset also supports REST APIs and custom connectors for more specialized use cases.

5. Visualization Layer

The visualization layer is where Superset shines. It offers a rich library of chart types, including bar charts, line charts, pie charts, geospatial visualizations, and more. Users can create custom visualizations using the Explore interface or by writing custom queries in SQL Lab. Superset also supports custom visualization plugins, allowing developers to extend its capabilities.

6. Caching Layer

To improve performance, Superset includes a caching layer that stores the results of frequently accessed queries. It integrates with caching systems like Redis or Memcached to reduce load on the database and speed up dashboard rendering. The caching layer is configurable, allowing users to set expiration times and other parameters.

7. Asynchronous Query Execution

For long-running queries, Superset supports asynchronous execution using Celery, a distributed task queue. This ensures that the web server remains responsive even when processing complex queries. Asynchronous execution is particularly useful in production environments where large datasets and multiple users are involved.

8. Security and Authentication

Superset provides robust security features, including role-based access control (RBAC), integration with OAuth providers, and support for LDAP and OpenID Connect. Administrators can define granular permissions for users and groups, ensuring that sensitive data is protected.

Key Components of Apache Superset


Let’s take a closer look at some of the key components that make Superset a powerful BI tool:

1. Dashboards

Dashboards are the primary way users interact with data in Superset. A dashboard is a collection of visualizations (charts, tables, etc.) that can be arranged and customized to tell a data-driven story. Dashboards are highly interactive, allowing users to filter, drill down, and explore data in real-time.

2. Charts

Charts are the building blocks of dashboards. Superset offers a wide variety of chart types, from simple bar charts to complex geospatial visualizations. Each chart is backed by a dataset and can be customized using the Explore interface or SQL queries.

3. Datasets

A dataset in Superset represents a table or view in a database. Users can define datasets and configure them for use in charts and dashboards. Superset also supports virtual datasets, which are created by writing custom SQL queries.

4. Explore Interface

The Explore interface is a no-code tool for creating charts and visualizations. It provides a user-friendly way to select datasets, choose chart types, and configure visualization settings. The Explore interface is designed for non-technical users, making it easy to create insightful visualizations without writing SQL.

5. SQL Lab

As mentioned earlier, SQL Lab is a powerful SQL IDE for advanced users. It supports features like query autocomplete, syntax highlighting, and query sharing. SQL Lab is ideal for data exploration and ad-hoc analysis.

6. Plugins and Extensions

Superset’s extensibility is one of its greatest strengths. Developers can create custom visualization plugins, add new database connectors, or extend Superset’s functionality using its REST API. This makes Superset highly adaptable to specific use cases and requirements. 

Why Choose Apache Superset?


Apache Superset stands out in the crowded BI landscape for several reasons:

Open Source: Superset is free to use and has a vibrant community of contributors.

Scalability: It can handle large datasets and multiple users with ease.

Flexibility: Superset supports a wide range of data sources and visualization types.

Ease of Use: Its intuitive interface makes it accessible to both technical and non-technical users.

Extensibility: Developers can customize and extend Superset to meet their needs.

Installation:


Now we are going to discusses how to install Apache Superset in a stand alone mode in ubuntu.

Note: Standalone mode is for testing purpose, for a production grade Apache Superset Platform install it on Kubernetes instead.

Important: Please make sure you have python 3.8.x is installed in your system, else you might end up having installation failure with compatibility issues.

Step 1: Run the following command to install required system packages:


sudo apt update && sudo apt install -y \

  python3 python3-pip python3-dev \

  build-essential libssl-dev libffi-dev \

  python3-venv libpq-dev libsasl2-dev \

  libmysqlclient-dev libldap2-dev libsqlite3-dev

Step 2: It's recommended to install Superset inside a virtual environment to avoid dependency issues.


python3 -m venv superset-venv
source superset-venv/bin/activate

Note: Use "python3.8 -m venv superset-venv" if you have multiple version of python installed in the system.

Step 3: Upgrade Pip & Install Dependencies


pip install --upgrade pip setuptools wheel

Step 4: Install Apache Superset


Now, install Superset using pip:

pip install apache-superset

Step 5: Initialize Superset


Run the following commands to set up Superset and when prompted enter a username, password, and email for the admin account.

superset db upgrade 
superset fab create-admin

userid: admin
password: admin1234

Step 6: Load Example Data (Optional)


To test Superset with sample data:

superset load_examples

Step 7: Initialize Superset Metadata


superset init

Step 8: Start Superset Web Server

To start the Superset UI:

superset run -p 8089 --with-threads --reload --debugger

Navigate to http://localhost:8089 to test if Apache Superset is running.

Step 9: Possible Errors and Fixes:


If the installation fails with JSONEncoder that means incompatibility between Flask and Flask-WTF in your Apache Superset installation. Specifically, JSONEncoder has been removed from flask.json in Flask version 2.3+, and older versions of Flask-WTF still reference it. To fix this issue, downgrade Flask to version 2.2.5, which retains JSONEncoder

source ~/superset-venv/bin/activate

pip uninstall flask

pip install flask==2.2.5

If the error is "Could not locate a Flask application" means that Superset cannot find the Flask app. This usually happens when the environment is not set up correctly.

source ~/superset-venv/bin/activate

pip install flask flask-wtf flask-appbuilder

export FLASK_APP=superset

If the error indicates a package version conflict. Superset requires flask-wtf<1.1, but you have flask-wtf 1.2.1 installed, which is incompatible.

pip install "flask-wtf<1.1,>=1.0.1"

pip show flask-wtf

pip install --upgrade --force-reinstall apache-superset

pip uninstall flask-wtf -y
pip install "flask-wtf<1.1,>=1.0.1"
pip install --upgrade apache-superset

If  Superset refusing to start due to insecure SECRET_KEY, which is insecure. You need to override it in the superset_config.py file.

openssl rand -base64 42

This will generate a secure random key. Example output:

Qj4u5Vf2oTp3P5zxy+w8Y6mFqLQYJ/U2PqYJ79KxYZc=

Set the SECRET_KEY in superset_config.py

nano ~/superset_config.py

SECRET_KEY = "Qj4u5Vf2oTp3P5zxy+w8Y6mFqLQYJ/U2PqYJ79KxYZc="

export SUPERSET_CONFIG_PATH=~/superset_config.py

superset run -p 8088 --with-threads --reload --debugger


If the error is missing the marshmallow_enum package. Try installing it inside your Superset virtual environment.

source /home/hdoop/superset-venv/bin/activate

pip install marshmallow_enum

superset run -p 8088 --with-threads --reload --debugger

Connect Apache Superset to Apache Hive 3.1.2, follow these steps:


Step 1: Install pyhive and thrift-sasl


source superset-venv/bin/activate

pip install git+https://github.com/dropbox/PyHive.git
pip install thrift thrift-sasl sasl

Step 2: Configure Hive Connection in Superset

Login to Superset UI. Add a Database Connection. Go to "Data" → "Databases". Click "+ Database" (or "Add Database"). Enter Connection Details. Use the following SQLAlchemy URI format:


hive://<username>:<password>@<hive-host>:<hive-port>/<database>

Example:

hive://hdoop@localhost:10000/default


Steps for starting Apache Superset after System Restart



source superset-venv/bin/activate
export FLASK_APP=superset
export SUPERSET_CONFIG_PATH=~/superset_config.py

#superset db upgrade
superset init
superset run -p 8089 --with-threads --reload --debugger

http://localhost:8089

Connection String from Apache Superset:

hive://<username>@<hive-host>:<hive-port>/<database>
hive://hdoop@localhost:10000/default


For more such insightful articles please refer to my blog .

Apache Airflow Basics and Installation

In the world of data engineering and workflow automation, Apache Airflow has emerged as one of the most popular tools for orchestrating complex data pipelines. Whether you're a data engineer, data scientist, or DevOps professional, understanding the fundamentals of Apache Airflow is essential for building scalable, maintainable, and efficient workflows. In this blog, we’ll dive into the core concepts of Apache Airflow, its architecture, and how to get started with it.


What is Apache Airflow?

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It allows you to define workflows as code, making them more dynamic, reusable, and version-controlled. Airflow is particularly well-suited for managing ETL (Extract, Transform, Load) processes, machine learning pipelines, and other data-driven workflows.

Key features of Apache Airflow include:

  • Dynamic Pipeline Generation: Workflows are defined in Python, enabling dynamic pipeline creation.
  • Extensibility: A rich ecosystem of operators and hooks allows integration with various services like AWS, GCP, databases, and more.
  • Scalability: Airflow can scale to handle thousands of tasks across multiple workers.
  • Monitoring and Logging: Built-in UI for tracking workflow execution and debugging.

Core Concepts of Apache Airflow

To understand Airflow, it’s important to familiarize yourself with its core components:

1. DAG (Directed Acyclic Graph)

  • A DAG is the backbone of Airflow. It represents a collection of tasks with dependencies, organized in a way that ensures they run in a specific order.

  • Directed: Tasks have a clear direction (dependencies).

  • Acyclic: No loops or circular dependencies are allowed.

  • DAGs are defined in Python scripts, making them flexible and dynamic.

2. Tasks

  • A task is a unit of work within a DAG. Each task represents an action, such as running a script, querying a database, or sending an email.

  • Tasks are implemented using Operators (e.g., PythonOperatorBashOperatorSQLOperator).


3. Operators

  • Operators define what a task does. Airflow provides a wide range of built-in operators for common tasks, such as:

    • BashOperator: Executes a Bash command.

    • PythonOperator: Executes a Python function.

    • EmailOperator: Sends an email.

    • Sensor: Waits for a specific condition to be met.

  • You can also create custom operators to suit your needs.


4. Scheduler

  • The Airflow scheduler is responsible for triggering tasks based on their dependencies and schedule intervals.

  • It ensures that tasks are executed in the correct order and at the right time.


5. Executors

  • Executors determine how tasks are run. Airflow supports multiple executors, including:

    • SequentialExecutor: Runs tasks sequentially (for debugging).

    • LocalExecutor: Runs tasks in parallel on a single machine.

    • CeleryExecutor: Distributes tasks across multiple workers (for scalability).

    • KubernetesExecutor: Runs tasks in Kubernetes pods.


Airflow Architecture


Apache Airflow follows a modular architecture, consisting of the following components:

  1. Web Server: Serves the Airflow UI for monitoring and interacting with DAGs.

  2. Scheduler: Parses DAGs, schedules tasks, and triggers them based on dependencies.

  3. Metadata Database: Stores metadata about DAGs, tasks, and their execution history (e.g., PostgreSQL, MySQL).

  4. Executor: Executes tasks based on the chosen execution strategy.

  5. Workers: Perform the actual task execution (used in distributed setups like Celery or Kubernetes).




Installation Steps

Step 1:

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv libpq-dev

Step 2:

mkdir ~/airflow
cd ~/airflow
python3 -m venv airflow_env
source airflow_env/bin/activate

Step 3: 

pip install --upgrade pip setuptools wheel

Step 4: 

export AIRFLOW_HOME=~/airflow

pip install apache-airflow[celery,postgres]==2.7.2 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.2/constraints-$(python -c 'import sys; print(".".join(map(str, sys.version_info[:2])))').txt"


Step 5:

airflow db init

Step 6:

airflow users create \
    --username admin \
    --firstname Airflow \
    --lastname Admin \
    --role Admin \
    --email admin@example.com \
    --password admin123

Step 7:

airflow webserver --port 8081 &

airflow scheduler &

Step 8: 

  • Open http://localhost:8081 in a browser
  • Login with admin / admin123


  • After Server Restart

    source airflow_env/bin/activate
    export AIRFLOW_HOME=~/airflow
    airflow webserver --port 8081 &
    airflow scheduler &


    Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

      Introduction In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems...