The Data Cook: Spark 3.4.4 Installation in Ubuntu

Step 1: Install Java

Install Java 8, as we will use the same VM for Spark installation where Hadoop 3.x and Hive 3.x are installed. Though we can use Java 11. Java 8 is a safe bate to build Hadoop ecosystem (3.x version)

Update the system with the below command

sudo apt update

Install OpenJDK 8, with below command

sudo apt install openjdk-8-jdk -y

check the installation with following command

java -version; javac -version

Step 2: Install Scala

Install scala with the below command:

sudo apt install scala -y

check the installation with the following command:

Step 3: Download Spark

Navigate to the Spark download page. And copy the download link path as shown in the below picture.

Use the copied link with wget command to download the spark binary, as shown below:

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz

The URL next to the wget is the one copied in the previous step. If the download is successful you should see the spark binary file downloaded to your current folder as shown in the picture.

Step 4: Extract the Spark package

Extract the archive Spark binary using the following tar command.

tar xvf spark-*.tgz

After extraction you should see a folder as shown below in the picture:

Step 5: Create a symlink

Create a symlink to the spark-3.4.4-bin-hadoop3 for easier configuration using the below command.

ln -s /home/hdoop/spark-3.4.4-bin-hadoop3/ /home/hdoop/spark

Step 6: Set the environment variables

Add the following lines to the .bashrc file.

export SPARK_HOME=/home/hdoop/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYSPARK_PYTHON=/usr/bin/python3

(check if you have python3 in your system and modify the path if necessary)

open the .bashrc using the following command and add the above lines:

nano .bashrc

(make sure you are at the user folder level, in this case /home/hdoop)

save the .bashrc file using CTRL+X and enter button.

Then load the update profile using the below command.

source ~/.bashrc

Step 7: Start Spark Master Server (Stand alone cluster no YARN)

Start the spark master server using the bellow command:

start-master.sh

Post this you can view the Spark Web UI at:

http://127.0.0.1:8080/

If all goes well you should see a web page similar to the one below:

Make a note of the host and the port of the master server (the one in yellow box). Also notice there are no workers running now (please refer to the green box)

Step 8: Start a worker process (Stand alone cluster no YARN)

Use the following command format to start a worker server in a single-server setup:

start-worker.sh spark://master_server:port

start-worker.sh -c 1 -m 512M spark://master_server:port (use this one if you want specfic CPU and memory size for the worker)

Note: Replace the master_server and port captured in the previous step.

Refresh the Spark Master's Web UI to see the new worker on the list.

Step 9: Test Spark Shell

To run the spark shell (integrated with Scala by default use the below command)

spark-shell

Upon successful execution you should see a screen like this, with spark version and Scala prompt:

Now you can use the Scala prompt to write spark programs interactively.

Step 10: Run a simple spark application

// Import implicits for easier syntax

import org.apache.spark.sql.DataFrame

import spark.implicits._

// Create a sequence of data (rows) as case class or tuples

val data = Seq(

(1, "Alice", 28),

(2, "Bob", 25),

(3, "Catherine", 30)

)

// Create a DataFrame from the sequence with column names

val df: DataFrame = data.toDF("ID", "Name", "Age")

// Show the contents of the DataFrame

df.show()

Note: You can use the :paste and ctrl+d combination to input multiline code in the spark-shell

Step 11: Exit the spark-shell

Supply the below command to exit from the spark (scala) shell:

Step 12: Test PySpark

Enter the following command to start PySpark

pyspark

Step 13: Run a sample PySpark Code

df = spark.createDataFrame(
    [
        ("sue", 32),
        ("li", 3),
        ("bob", 75),
        ("heo", 13),
    ],
    ["first_name", "age"],
)

df.show()

Use quit() to exit the pyspark shell

Step 14: Stopping Spark

Stop Master Server: stop-master.sh

Stop Worker Process: stop-worker.sh

Step 15: Configure Spark to use YARN

Note: Please make sure your out of the stand alone cluster mode by executing step 14.

Start the hadoop services if not running already, supply the below commands

cd /home/hdoop/hadoop/sbin

./start-all.sh

jps

Open the .bashrc file and add the environment variable HADOOP_CONF_DIR

nano .bashrc (at /home/hdoop/)

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Save and Exit (CTRL+X followed by enter)

Refresh the profile

source ~/.bashrc

Start the Spark Shell in YARN Mode now:

spark-shell --master yarn (scala)

pyspark --master yarn (python)

Go to the YARN UI (http://localhost:8088) and verify.

Step 16: Configure Spark to start in YARN mode by default

Stop all the spark application (spark-shell, spark-sql, pyspark, and spark-submit) if running.

Open (or create if not exist) spark-defaults.conf file: nano spark/conf/spark-defaults.conf

And add the below configuration to the files:

spark.master yarn

spark.submit.deployMode client (can be cluster as well)

Start the spark-shell by firing the command: spark-shell (without --master YARN). Navigate the YARN UI (http://localhost:8088), check if spark-shell is listed as shown in the below picture.

The Data Cook

Monday, 27 January 2025

Spark 3.4.4 Installation in Ubuntu