In this project we will building HR Analytics to study key aspects of HR Operations. We will ingest the transactional data and reference data from MySQL database (backend of the HRMS) to our Big Data Ecosystem build using Apache Opensource Big Data Frameworks (Hadoop, Hive, HBase, Spark etc.). Then we will perform data cleansing, data Transformation, and implement the business logic to have key metrices developed in the form of graphs, charts table etc in the Apache Superset BI tool, which can be consumed by key HR personnel and executives.
Architecture
Source Data Preparation
We will first prepare our source data required for this project. Please follow the below steps.
Step 1: Create a database named employees in MySQL database. Run the below commands:
sudo mysqlCREATE DATABASE employees;USE employees;exit;
Then exit the mysql.
Step 2: Download the HR Database
Download the HR Database from this github repository. Please execute the below commands.
git clone https://github.com/datacharmer/test_db.git
cd test_db
sudo mysql < employees.sql
Step 3: Verify if the tables has been created with sample data
Run the mysql terminal and verify if the tables are created with sample data, execute the below commands:
sudo mysqlUSE employees;SHOW TABLES;SELECT * FROM employees LIMIT 100;
Data Ingestion
In this step we will ingest the data from the MySQL Database (backend of HRMS) to Hadoop platform using Sqoop.
One time Historical Data Ingestion
Please develop and execute Sqoop scripts to ingest the one time historical data from the below tables to Hive directly.
Periodic Batch Ingestion
Develop and execute Sqoop command to load only the delta data incrementally to the already created hive tables for the same MySQL tables.
Scheduling
Schedule the Sqoop ingestion job to run daily at 01:00 AM in Apache Airflow.
Visualization
Create a HR Analytics Dashboard in Apache Superset with following graphs.
- Bar Chart with number of employee per department
- Trendline of number of employee joining in each department over the year
- Trendline of number of employee leaving in each department
- Bar Chart to show average age of employees in each department
- Trendline to show male and female hiring rates over the year.
- Pie Chart to show number of employees in each age group for each department. Create age groups as mentioned below.
- 21 - 30
- 31 - 40
- 41 - 50
- 50+
- Trendline to show the salary cost of each department over the year.
- Pie Chart to show the number of employees in each title for different departments.
- Table to show the top 10 active highest paid salary employees (employee id, employee full name, age, department name, title, gender)
Serving Hive Tables
To create the above visualization create additional hive tables that will serve as input to the Apache Superset Charts. Create hql files to load the data (preferably incremental) to these serving hive tables.
Scheduling
Schedule the hql in Apache Airflow to run in a daily manner post the data ingestion run successfully.
No comments:
Post a Comment