Saturday, 1 February 2025

Spark and Hive Metastore Integration


Spark can be integrated with hive metastore to have a common metastore layer between hive and spark. In this blog I will detail out steps on how to reuse hive metastore for spark engine.

Prerequisite:

1. Existing Hadoop installation

2. Existing Hive Installation

3. Existing Spark Installation: Steps to install Spark can be found here

Step 1: Copy the Hive Metastore RDBMS Driver from hive/lib to spark/jars folder

Command: cp hive/lib/mysql-connector-java-8.0.28.jar spark/jars/

Note: Assuming the Hive Metastore is MySQL database.




Step 2: Ensure MySQL and Hive Metastore Services are running

command:

sudo systemctl start mysql

hive --service metastore &

Step 3: Edit $SPARK_HOME/conf/spark-defaults.conf (create it if missing):


Add the following line.

spark.sql.catalogImplementation=hive


Step 4: Verify Spark-Hive Metastore Integration


Start Spark Shell: spak-shell

And execute below line at the Scala prompt: spark.sql("SHOW DATABASES").show()

If it shows all the hive databases, then the integration is successful.


Step 5: Make Sure Hadoop Services are up and running, if not start it.




Command:

To verify: jps

To start:
cd /home/hdoop/hadoop/sbin
./start-all.sh

Step 6: Run a HQL to read the data (which is in HDFS) from a table


 Step 7: Accessing HIVE Databases and Tables from spark-sql

If the above configuration is working fine, hive databases and tables can be access direct from the spark-sql.

Command: spark-sql









No comments:

Post a Comment

Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

  Introduction In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems...