Breaking News

CDH4 Hadoop/HBase Installation for Ubuntu 12.04 LTS

Changing the Hostname and /etc/hosts(This will be applicable for manual and automated installation)

Step A:Change the hostname of each machine to a meaningful name. Login to each node as ubuntu user(In case of AWS-VPC)

sudo nano /etc/hostname
delete the content and add master.domain for master machine, slave1.domain for slave1 node and slave2.domain for slave2.

Step B: restart the hostname service

sudo service hostname restart

Step C: /etc/hosts might contain something of the form:

127.0.0.1 localhost.localdomain localhost
192.168.1.1 master.domain master
192.168.1.2 slave1.domain slave1
192.168.1.3 slave2.domain slave2


Manual Installation


1)Install Oracle Java
2)Passwordless SSH from master to slaves
3) Hadoop Configuration & Hbase Configuration
4) Test the setup


1. Install Oracle Java

Steps:
1.Download Oracle Java from
http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jdk-6u32-oth-JPR

Accept the license and download "jdk-6u32-linux-x64.bin" and Keep it in Downloads directory of the linux machine.

2.Create the installation folder

Command: sudo mkdir -p /usr/lib/jvm

3. Navigate to Downloads Directory

Command: cd ~/Downloads

4. Move the downloaded files to the installation folder

Command: sudo mv jdk-6u32-linux-x64.bin /usr/lib/jvm

5. Naviagate to the installation folder

Command: cd /usr/lib/jvm

6. Make the downloaded binaries executable

Command: sudo chmod u+x jdk-6u32-linux-x64.bin

7. Extract both compressed binary files

Command: sudo ./jdk-6u32-linux-x64.bin

8. Check your extracted folder names

Command: ls -l
Check: jdk1.6.0_32 directory is there

9. Inform Ubuntu where your Java installation is located

Command: sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_32/bin/java" 1

10. Inform Ubuntu that this is your default Java installation

Command: sudo update-alternatives --set java /usr/lib/jvm/jdk1.6.0_32/bin/java

11. Update your system-wide PATH

Command: sudo nano /etc/profile

Add below lines to the /etc/profile at the end.

JAVA_HOME=/usr/lib/jvm/jdk1.6.0_32
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH

12. Reload your system-wide PATH

Command: source /etc/profile


13. Test your new installation

Command: java -version

Output : java version "1.6.0_32"
Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
Java HotSpot(TM) Client VM (build 20.7-b02, mixed mode, sharing)


Then its installed properly !!!

2. Passwordless SSH

Prerequisties:
Need to have the same user in all nodes and logged in as the same user

Step 1: Generate SSH Key

Command: ssh-keygen -t rsa

Step 2: Enable SSH access to your master machine with this newly generated key

Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 3: Copy the master machine's public key to slave machine's authorized keys

Master machine
a)Command : cat $HOME/.ssh/id_rsa.pub

b) copy the content

Slave machine
c)Command : nano $HOME/.ssh/authorized_keys
d) paste the content
e) press Ctrl+o
f) Press Enter
g) Press Ctrl+x

Step 4: From master machine, check by issueing

ssh slave

Verify that it should not ask password and it has to login


3. Hadoop Configuration

Step 1: create the source directories and hadoop filesystem directories and give necessary permissions.

sudo mkdir  /hadoop

sudo chmod -R 755 /hadoop

sudo chown -R ubuntu:ubuntu /hadoop

sudo mkdir /dfs

sudo chmod -R 755 /dfs

sudo chown -R ubuntu:ubuntu /dfs

Note: Execute the commands in all nodes



Step 2: Navigate to /hadoop directory and Download the hadoop packages from the below site

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDHTarballs/3.25.2013/CDH4-Downloadable-Tarballs/CDH4-Downloadable-Tarballs.html

a) Download hadoop-2.0.0+922 from http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.2.0.tar.gz
b) Download hbase-0.94.2+202 from http://archive.cloudera.com/cdh4/cdh/4/hbase-0.94.2-cdh4.2.0.tar.gz
c) download mr1-2.0.0-mr1-cdh4.2.0 http://archive.cloudera.com/cdh4/cdh/4/mr1-2.0.0-mr1-cdh4.2.0.tar.gz

cd /hadoop

wget http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.2.0.tar.gz

wget http://archive.cloudera.com/cdh4/cdh/4/hbase-0.94.2-cdh4.2.0.tar.gz

wget  http://archive.cloudera.com/cdh4/cdh/4/mr1-2.0.0-mr1-cdh4.2.0.tar.gz


Step 3: Extract the packages and rename the directory.

tar zxf hadoop-2.0.0-cdh4.2.0.tar.gz
tar zxf hbase-0.94.2-cdh4.2.0.tar.gz
tar zxf mr1-2.0.0-mr1-cdh4.2.0.tar.gz

mv hadoop-2.0.0-cdh4.2.0 chadoop-2.0.0
mv hbase-0.94.2-cdh4.2.0 chbase-0.94.2
mv mr1-2.0.0-mr1-cdh4.2.0  hadoop-2.0.0-mr1



Step 4: Navigate to hadoop-2.0.0-cdh4.2.0 conf and make necessary configuration changes

cd /hadoop/chadoop-2.0.0/etc/hadoop

a) core-site.xml

nano core-site.xml

and paste the below contents in between the <configuration> </configuration> tags.

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:8020</value>
  </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/dfs/tmp</value>
</property>

b) hdfs-site.xml

nano hdfs-site.xml

and paste the below contents in between the <configuration> </configuration> tags.



<property>
        <name>dfs.name.dir</name>
        <value>/dfs/nn</value>
 </property>
<property>
        <name>dfs.data.dir</name>
        <value>/dfs/dn</value>
</property>
<property>
         <name>dfs.datanode.max.xcievers</name>
         <value>4096</value>
</property>
<property>
         <name>dfs.replication</name>
         <value>3</value>
</property>

c) mapred-site.xml

cd /hadoop/hadoop-2.0.0-mr1/conf

nano  mapred-site.xml

and paste the below contents in between the <configuration> </configuration> tags.

 <property>
    <name>mapred.job.tracker</name>
    <value>master.test.com:8021</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapred.output.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>io.sort.record.percent</name>
    <value>0.05</value>
  </property>
  <property>
    <name>io.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.submit.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>99</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx419222449</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.8</value>
  </property>

d) hadoop-env.sh

cd /hadoop/chadoop-2.0.0/etc/hadoop
nano hadoop-env.sh

and paste the below content

export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_32

export HADOOP_HEAPSIZE=2048


export HADOOP_OPTS="-server  -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true $HADOOP_CLIENT_OPTS"


and update the same in  /hadoop/hadoop-2.0.0-mr1/conf/hadoop-env.sh


e) slaves

cd /hadoop/chadoop-2.0.0/etc/hadoop
nano slaves

add the slave node each one in single line

update teh same in /hadoop/hadoop-2.0.0-mr1/conf/slaves


f) hbase-site.xml

cd /hadoop/chbase-0.94.2/conf

nano hbase-site.xml

and paste the below contents in between the <configuration> </configuration> tags.
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://master.test.com:8020/hbase</value>
  </property>
   <property>
          <name>hbase.cluster.distributed</name>
          <value>true</value>
    </property>
        <property>
           <name>hbase.zookeeper.property.clientPort</name>
           <value>2181</value>
    </property>
        <property>
           <name>hbase.zookeeper.property.dataDir</name>
           <value>/dfs/zookpr</value>
    </property>

        <property>
           <name>hbase.zookeeper.property.maxClientCnxns</name>
           <value>1000</value>
    </property>
  <property>
    <name>hbase.client.write.buffer</name>
    <value>2097152</value>
  </property>
  <property>
    <name>hbase.client.pause</name>
    <value>1000</value>
  </property>
  <property>
    <name>hbase.client.retries.number</name>
    <value>10</value>
  </property>
  <property>
    <name>hbase.client.scanner.caching</name>
    <value>1</value>
  </property>
  <property>
    <name>hbase.client.keyvalue.maxsize</name>
    <value>10485760</value>
  </property>
  <property>
    <name>hbase.security.authentication</name>
    <value>simple</value>
  </property>
  <property>
    <name>zookeeper.session.timeout</name>
    <value>120000</value>
  </property>
  <property>
    <name>zookeeper.znode.parent</name>
    <value>/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>master.test.com</value>
  </property>
 <property>
    <name>hbase.regionserver.handler.count</name>
     <value>16</value>
    </property>
   <property>
     <name>hfile.block.cache.size</name>
       <value>0.4</value>
    </property>

g)hbase-env.sh

cd /hadoop/chbase-0.94.2/conf

nano hbase-env.sh

add the following lines


export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_32
export HBASE_HEAPSIZE=2048
export HBASE_OPTS="-server  $HBASE_OPTS -XX:+UseConcMarkSweepGC"



Step 4: Transfer the configuration files to all the slave nodes

scp -r /hadoop/hadoop-2.0.0-mr1 /hadoop/chadoop-2.0.0 /hadoop/chbas* slave1:/hadoop
repeat the same for all datanodes.

Step 5: Format the namenode

Navigate to namenode and format the namenode

/hadoop/chadoop-2.0.0/bin/hdfs namenode -format


Step 6: start dfs

Navigate to namenode and issue the command start-dfs.sh

/hadoop/chadoop-2.0.0/sbin/start-dfs.sh

step 7: start mapred

Navigate to namenode and issue the command start-mapred.sh

/hadoop/hadoop-2.0.0-mr1/bin/start-mapred.sh

wait for few seconds for the namenode to come out of safe mode

step 8: check the hadoop works by creating directory,put some files and get

/hadoop/chadoop-2.0.0/bin/hdfs dfs -mkdir /test
/hadoop/chadoop-2.0.0/bin/hdfs dfs -put /sourcefile /test/ # replace the /sourcefile with the your file

step 9: start the hbase

/hadoop/chbase-0.94.2/bin/start-hbase.sh

/hadoop/chbase-0.94.2/bin/hbase shell

create 't1',{NAME=>'cf'}
put 't1','r1','cf:c1','v1'

scan 't1'

If you get the results, then hadoop and hbase is working.


Advanced Hbase configuration


1.sudo nano /etc/security/limits.conf

add the content

ubuntu  -       nofile  32768

2.sudo nano /etc/pam.d/common-session

add the content

session required  pam_limits.so



To format the disk with ext3


sudo mkfs.ext3 -m 1 /dev/xvdd

For mount the disk with no accesstime option


sudo mount -O noatime /dev/xvdd /dfs





No comments