Step 1: Install Java
Install Java 8, as we will use the same VM for Spark installation where Hadoop 3.x and Hive 3.x are installed. Though we can use Java 11. Java 8 is a safe bate to build Hadoop ecosystem (3.x version)
Update the system with the below command
sudo apt update
Install OpenJDK 8, with below command
sudo apt install openjdk-8-jdk -y
check the installation with following command
java -version; javac -version
Step 2: Install Scala
Install scala with the below command:
sudo apt install scala -y
check the installation with the following command:
Navigate to the Spark download page. And copy the download link path as shown in the below picture.
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz
The URL next to the wget is the one copied in the previous step. If the download is successful you should see the spark binary file downloaded to your current folder as shown in the picture.
Extract the archive Spark binary using the following tar command.
tar xvf spark-*.tgz
After extraction you should see a folder as shown below in the picture:
Step 5: Create a symlink
Create a symlink to the spark-3.4.4-bin-hadoop3 for easier configuration using the below command.
ln -s /home/hdoop/spark-3.4.4-bin-hadoop3/ /home/hdoop/spark
Step 6: Set the environment variables
Add the following lines to the .bashrc file.
export SPARK_HOME=/home/hdoop/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3
(check if you have python3 in your system and modify the path if necessary)
open the .bashrc using the following command and add the above lines:
nano .bashrc
(make sure you are at the user folder level, in this case /home/hdoop)
save the .bashrc file using CTRL+X and enter button.
Then load the update profile using the below command.
source ~/.bashrc
Step 7: Start Spark Master Server (Stand alone cluster no YARN)
Start the spark master server using the bellow command:
start-master.sh
Post this you can view the Spark Web UI at:
http://127.0.0.1:8080/
If all goes well you should see a web page similar to the one below:
Make a note of the host and the port of the master server (the one in yellow box). Also notice there are no workers running now (please refer to the green box)
Step 8: Start a worker process (Stand alone cluster no YARN)
Use the following command format to start a worker server in a single-server setup:
start-worker.sh spark://master_server:port
start-worker.sh -c 1 -m 512M spark://master_server:port (use this one if you want specfic CPU and memory size for the worker)
Note: Replace the master_server and port captured in the previous step.
Refresh the Spark Master's Web UI to see the new worker on the list.
Step 9: Test Spark Shell
To run the spark shell (integrated with Scala by default use the below command)
spark-shell
Upon successful execution you should see a screen like this, with spark version and Scala prompt:
Step 10: Run a simple spark application
// Import implicits for easier syntax
import org.apache.spark.sql.DataFrame
import spark.implicits._
// Create a sequence of data (rows) as case class or tuples
val data = Seq(
(1, "Alice", 28),
(2, "Bob", 25),
(3, "Catherine", 30)
)
// Create a DataFrame from the sequence with column names
val df: DataFrame = data.toDF("ID", "Name", "Age")
// Show the contents of the DataFrame
df.show()
Note: You can use the :paste and ctrl+d combination to input multiline code in the spark-shell
Step 11: Exit the spark-shell
Supply the below command to exit from the spark (scala) shell:
:q
Step 12: Test PySpark
Enter the following command to start PySpark
pyspark
df = spark.createDataFrame(
[
("sue", 32),
("li", 3),
("bob", 75),
("heo", 13),
],
["first_name", "age"],
)
df.show()
Use quit() to exit the pyspark shell
Step 14: Stopping Spark
Stop Master Server: stop-master.sh
Stop Worker Process: stop-worker.sh
Step 15: Configure Spark to use YARN
Note: Please make sure your out of the stand alone cluster mode by executing step 14.
Start the hadoop services if not running already, supply the below commands
cd /home/hdoop/hadoop/sbin
./start-all.sh
jps
Open the .bashrc file and add the environment variable HADOOP_CONF_DIR
nano .bashrc (at /home/hdoop/)
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Save and Exit (CTRL+X followed by enter)
Refresh the profile
source ~/.bashrc
Start the Spark Shell in YARN Mode now:
spark-shell --master yarn (scala)
pyspark --master yarn (python)
Go to the YARN UI (http://localhost:8088) and verify.
Step 16: Configure Spark to start in YARN mode by default
Stop all the spark application (spark-shell, spark-sql, pyspark, and spark-submit) if running.
Open (or create if not exist) spark-defaults.conf file: nano spark/conf/spark-defaults.conf
And add the below configuration to the files:
spark.master yarn
spark.submit.deployMode client (can be cluster as well)
Start the spark-shell by firing the command: spark-shell (without --master YARN). Navigate the YARN UI (http://localhost:8088), check if spark-shell is listed as shown in the below picture.
No comments:
Post a Comment