Sunday, 9 February 2025

Installing Hadoop on Ubuntu

 Steps for Java 8 installations

..............................

Step 1: sudo apt update

Step 2: sudo apt install openjdk-8-jdk -y

Step 3: java -version; javac -version (verification of java installations)



Steps for Hadoop 3.3.0 Installations

......................................

Step 1: (Install OpenSSH server and client): sudo apt install openssh-server openssh-client -y

Step 2: Create an admin user for hadoop ecosystem: sudo adduser hdoop

Step 3: su - hdoop (always repeat this step when starting hadoop)

Step 4: (Enable Passwordless SSH for Hadoop User) ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Step 5: (Store the public key as authorized_keys in the ssh directory) cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 6: (Set the file permissions) chmod 0600 ~/.ssh/authorized_keys

Step 7: ssh the new user to local host: ssh localhost (always repeat this step when starting hadoop)

Step 8: (download the tar): wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Step 9: (untar): tar xzf hadoop-3.3.0.tar.gz

Step 10: (create symlink): ln -s /home/hdoop/hadoop-3.3.0/ /home/hdoop/hadoop

Step 11: (Set the following environment variables): nano .bashrc


#JAVA Related Options

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


#Hadoop Related Options

export HADOOP_HOME=/home/hdoop/hadoop

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar


#HIVE Related Options

export HIVE_HOME=/home/hdoop/hive

export PATH=$PATH:$HIVE_HOME/bin


Step 12: (Source the .bashrcsource): ~/.bashrc

Step 13: (configure haddop environment): nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh (uncomment JAVA_HOME and configure as follows)


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


Step 14: (Configure hadoop core): nano $HADOOP_HOME/etc/hadoop/core-site.xml (add the below properties)


<property>

  <name>hadoop.tmp.dir</name>

  <value>/home/hdoop/tmpdata</value>

</property>

<property>

  <name>fs.default.name</name>

  <value>hdfs://127.0.0.1:9000</value>

</property>


Step 15: (Create the /home/hdoop/tmpdata folder): mkdir /home/hdoop/tmpdata

Step 16: (edit hdfs site file): nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml (add the below entries)


<property>

  <name>dfs.data.dir</name>

  <value>/home/hdoop/dfsdata/namenode</value>

</property>

<property>

  <name>dfs.data.dir</name>

  <value>/home/hdoop/dfsdata/datanode</value>

</property>

<property>

  <name>dfs.replication</name>

  <value>1</value>

</property>


Step 17: (Configure mapred site): nano $HADOOP_HOME/etc/hadoop/mapred-site.xml (add the below property)


<property>

  <name>mapreduce.framework.name</name>

  <value>yarn</value>

</property>


Step 18: (Configure yarn): nano $HADOOP_HOME/etc/hadoop/yarn-site.xml (add the below properties)


<property>

  <name>yarn.nodemanager.aux-services</name>

  <value>mapreduce_shuffle</value>

</property>

<property>

  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

  <value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

  <name>yarn.resourcemanager.hostname</name>

  <value>127.0.0.1</value>

</property>

<property>

  <name>yarn.acl.enable</name>

  <value>0</value>

</property>

<property>

  <name>yarn.nodemanager.env-whitelist</name>   

  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>


Step 19: (Format Name Node): hdfs namenode -format

Step 20: Play around with hdfs dfs command (File already provided)


Steps for running WordCount MapReduce Programs

.................................................

Step 1: create following path to store your mapreduce application (optional): mkdir /home/hdoop/application/mapreduce_app/wordcount_app/

Step 2: create the mapreduce java application file: nano WordCount.java


import java.io.IOException;

import java.util.StringTokenizer;


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCount {


  public static class TokenizerMapper

       extends Mapper<Object, Text, Text, IntWritable>{


    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();


    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

  }


  public static class IntSumReducer

       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private IntWritable result = new IntWritable();


    public void reduce(Text key, Iterable<IntWritable> values,

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }


  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "word count");

    job.setJarByClass(WordCount.class);

    job.setMapperClass(TokenizerMapper.class);

    job.setCombinerClass(IntSumReducer.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}


Step 3: compile the java file: hadoop com.sun.tools.javac.Main WordCount.java

Step 4: create the jar file: jar cf wc.jar WordCount*.class

Step 5: submit the mapreduce program to the cluster: hadoop jar wc.jar WordCount /input/test/file1.txt /output/test/

(create the input hdfs path and load the file1.txt having some text to the hdfs path using hdfs dfs -put command)

(don't create the output path)


....................Hive 3.1.2 Installation Steps .......

Step 1: Download the tar at /home/hdoop/: wget https://apache.root.lu/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Step 2: Untar the tar file: tar xzf apache-hive-3.1.2-bin.tar.gz

Step 3: Create a symlink: ln -s /home/hdoop/apache-hive-3.1.2-bin /home/hdoop/hive

Step 4: edit the .bashrc to add the following environment variables:


export HIVE_HOME=/home/hdoop/hive

export PATH=$PATH:$HIVE_HOME/bin/


Step 5: nano $HADOOP_HOME/etc/hadoop/mapred-site.xml: Add the following Configuration


<property>

  <name>mapreduce.map.memory.mb</name>

  <value>4096</value>

</property>

<property>

  <name>mapreduce.reduce.memory.mb</name>

  <value>4096</value>

</property>


Step 6: Resrtart Hadoop


cd /home/hdoop/hadoop/sbin

./stop-all.sh

./start-all.sh


Step 7: Create temp and warehouse folder in HDFS and assign permissions


hdfs dfs -mkdir -p /user/hive/warehouse

hdfs dfs -mkdir -p /tmp/hive


hdfs dfs -chmod 777 /tmp

hdfs dfs -chmod 777 /tmp/hive


hdfs dfs -chmod 777 /user/hive/warehouse


Step 8: Delete Obsolete log4j-slf4j


ls /home/hdoop/hive/lib/log4j-slf4j-impl-*.jar

rm /home/hdoop/hive/lib/log4j-slf4j-impl-2.10.0.jar


Step 9: Replace Guava.Jar


ls /home/hdoop/hadoop/share/hadoop/common/lib/guava*.jar

ls /home/hdoop/hive/lib/guava*.jar

cp /home/hdoop/hadoop/share/hadoop/common/lib/guava*.jar /home/hdoop/hive/lib/

rm /home/hdoop/hive/lib/guava-19.0.jar


Step 10: Intialize Derby Database

/home/hdoop/hive/bin/schematool -initSchema -dbType derby


a metastore_db should be created upon successfull execution


Step 10: Configure yarn-site.xml: nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

(Add the below properties at the end)


<property>

  <name>yarn.nodemanager.vmem-check-enabled</name>

  <value>false</value>

</property>

<property>

  <name>yarn.nodemanager.vmem-pmem-ratio</name>

  <value>4</value>

</property>


Step 11: Start hive: hive


Create some tables and check if HQL is able to trigger map-reduce jobs


...................MySQL as meta store for (Hive 3.1.2).......

Step 1: Install MySQL


sudo apt update

sudo apt install mysql-server


Step 2: Ensure MySQL erver is running: sudo systemctl start mysql.service


Step 3: Create user credentials for hive


sudo mysql

CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'Hive111!!!';

GRANT ALL PRIVILEGES ON *.* TO 'hiveuser'@'localhost';


Step 4: Download mysql-connector-java-8.0.28.jar to /home/hdoop/hive/lib/


cd /home/hdoop/hive/lib/

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar


Step 5: create hive-site.xml file in /home/hdoop/hive/conf/


nano /home/hdoop/hive/conf/hive-site.xml


(add the below configuratuons)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

    <name>javax.jdo.option.ConnectionURL</name>

    <value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExist=true</value>

    <description>

      JDBC connect string for a JDBC metastore.

      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

    </description>

</property>

<property>

    <name>javax.jdo.option.ConnectionDriverName</name>

    <value>com.mysql.cj.jdbc.Driver</value>

    <description>Driver class name for a JDBC metastore</description>

</property>

<property>

    <name>javax.jdo.option.ConnectionPassword</name>

    <value>Hive111!!!</value>

    <description>password to use against metastore database</description>

</property>

<property>

    <name>javax.jdo.option.ConnectionUserName</name>

    <value>hiveuser</value>

    <description>Username to use against metastore database</description>

</property>

</configuration>


Step 6: Hive Schema Initial Tool


/home/hdoop/hive/bin/schematool  -initSchema  -dbType mysql

login to MySQL to check if metastore_db database is created or not


Step 7: Restart hadoop and hive and check if HQL are running





No comments:

Post a Comment

Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

  Introduction In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems...