The Data Cook: Using Spark as the Hive Execution Engine

In this blog I will be discussing how to configure Spark as an execution engine for HIVE.

Step 0: Prerequisite

Existing Java, Hadoop and HIVE installation
Find the compatible Spark version that can be an execution engine for your HIVE. Here is the HIVE and Spark compatibility matrix.

I have HIVE 3.1.2 in my VM, so I will download and configure Spark 2.4.8 as the execution engine of HIVE as HIVE 3.1.2 is compatible with Spark 2.4.8.

Step 1: Configure Environment Variables

Please make sure following environment variables are configured in your .bashrc file:

#JAVA Related Options
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#HIVE Related Options
export HIVE_HOME=/home/hdoop/hive
export PATH=$PATH:$HIVE_HOME/bin

Refresh the profile with the command: ~/.bahrc

Step 2: Download Spark 2.4.8 version

Download the Spark 2.4.8 without Hadoop tar from the Spark Archive. Copy the link address as shown in the picture and download the .tgz file to your current directory using wget command, and then unzip it.

wget https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-without-hadoop.tgz

tar xvf spark-*.tgz

Step 3: Add the Spark Dependency to Hive

Create link to the following Jars in $HIVE_HOME/lib (pointing to respective jar in spark-2.4.8-bin-without-hadoop/jars). Execute the below commands to create the links.

cd $HIVE_HOME/lib

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/scala-library*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-core*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-network-common*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-network-shuffle*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/jersey-server*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/jersey-container-servlet-core*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/jackson-module*.jar

ln -s/home/hdoop/spark-2.4.8-bin-without-hadoop/jars/chill*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/json4s-ast*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/kryo-shaded*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/minlog*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/scala-xml*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-launcher*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-network-shuffle*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/spark-unsafe*.jar

ln -s /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/xbean-asm5-shaded.jar

Step 4: Configure Spark to Access Hadoop Class path

Edit the spark-env.sh (create it from spark-env.sh.template if not exists) and then add the following configurations.

nano /home/hdoop/spark-2.4.8-bin-without-hadoop/conf/spark-env.sh

#Add the below lines:

export SPARK_DIST_CLASSPATH=$(/home/hdoop/hadoop/bin/hadoop classpath)

#Spark related options

export SPARK_HOME=/home/hdoop/spark-2.4.8-bin-without-hadoop

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYSPARK_PYTHON=/usr/bin/python3

Step 5: Configure Hive to Access Spark Jars

Edit the hive-env.sh file and make sure following entries exists.

nano /home/hdoop/hive/conf/hive-env.sh

#Add the below lines

export HADOOP_HOME=/home/hdoop/hadoop

# Hive Configuration Directory can be controlled by:

export HIVE_CONF_DIR=$HIVE_HOME/conf

export SPARK_HOME=/home/hdoop/spark-2.4.8-bin-without-hadoop

export SPARK_JARS=""

for jar in `ls $SPARK_HOME/jars`; do

export SPARK_JARS=$SPARK_JARS:$SPARK_HOME/jars/$jar

done

export HIVE_AUX_JARS_PATH=$SPARK_JARS

Step 6: Configure Hive to use Spark Engine in YARN mode

Add the following entry to hive-site.xml

nano /home/hdoop/hive/conf/hive-site.xml
# Add the following lines
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>
<property>
    <name>spark.master</name>
    <value>yarn</value>
</property>
<property>
    <name>spark.eventLog.enabled</name>
    <value>true</value>
</property>
<property>
    <name>spark.eventLog.dir</name>
    <value>/tmp</value>
</property>
<property>
    <name>spark.driver.memory</name>
    <value>2g</value>
</property>
<property>
    <name>spark.executor.memory</name>
    <value>2g</value>
</property>
<property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
    <name>spark.yarn.jars</name>
    <value>hdfs://127.0.0.1:9000/spark/jars/*</value>
    <!-- <value>hdfs:///spark/jars/*.jar</value> -->
</property>
<property>
    <name>spark.submit.deployMode</name>
    <value>client</value>
    <!-- <value>cluster</value> -->
</property>
<!--
<property>
    <name>spark.yarn.queue</name>
    <value>default</value>
</property>
-->
<property>
  <name>hive.spark.job.monitor.timeout</name>
  <value>600</value>
</property>
<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
</property>

Step 7: Copy Spark jar's in Spark/Jar folder to hdfs:///spark/jars/

Copy all the jars in /home/hdoop/spark-2.4.8-bin-without-hadoop/jars path to hdfs:///spark/jars/ (HDFS path). Refer to the previous step, we have pointed the "spark.yarn.jars" to hdfs:///spark/jars/ in hive-site.xml. YARN will look for Spark in this HDFS path.

hdfs dfs -mkdir -p /spark/jars/

hdfs dfs -put /home/hdoop/spark-2.4.8-bin-without-hadoop/jars/* /spark/jars/*

Step 8: Restart Hadoop and Hive Services

After the above changes restart all the HADOOP and hive service with the below commands.

cd /home/hdoop/hadoop/sbin

./stop-all.sh

./start-all.sh

jps

hive

Step 9: Run an HQL that triggers execution engine (Spark)

Run an HQL that will trigger the execution engine, example: select count(*) from emplyees;

If all the configurations are correct you should see output similar to the below snapshot.

Also you should notice a HIVE on Spark application in YARN UI.

The Data Cook

Tuesday, 28 January 2025

Using Spark as the Hive Execution Engine