The Data Cook: Spark and Hive Metastore Integration

Saturday, 1 February 2025

Spark and Hive Metastore Integration

Spark can be integrated with hive metastore to have a common metastore layer between hive and spark. In this blog I will detail out steps on how to reuse hive metastore for spark engine.

Prerequisite:

1. Existing Hadoop installation

2. Existing Hive Installation

3. Existing Spark Installation: Steps to install Spark can be found here

Step 1: Copy the Hive Metastore RDBMS Driver from hive/lib to spark/jars folder

Command: cp hive/lib/mysql-connector-java-8.0.28.jar spark/jars/

Note: Assuming the Hive Metastore is MySQL database.

Step 2: Ensure MySQL and Hive Metastore Services are running

command:

sudo systemctl start mysql

hive --service metastore &

Step 3: Edit $SPARK_HOME/conf/spark-defaults.conf (create it if missing):

Add the following line.

spark.sql.catalogImplementation=hive

Step 4: Verify Spark-Hive Metastore Integration

Start Spark Shell: spak-shell

And execute below line at the Scala prompt: spark.sql("SHOW DATABASES").show()

If it shows all the hive databases, then the integration is successful.

Step 5: Make Sure Hadoop Services are up and running, if not start it.

Command:

To verify: jps

To start:

cd /home/hdoop/hadoop/sbin

./start-all.sh

Step 6: Run a HQL to read the data (which is in HDFS) from a table

Step 7: Accessing HIVE Databases and Tables from spark-sql

If the above configuration is working fine, hive databases and tables can be access direct from the spark-sql.

Command: spark-sql

The Data Cook