Kalyan Hadoop and Spark Training in Hyderabad Learn Big Data From Basics... @ Kalyan @

Mr.Kalyan, Apache Contributor, Cloudera CCA175 Certified Consultant, 8+ years of Big Data exp, IIT Kharagpur, Gold Medalist.

This blog is mainly meant for Learn Big Data From Basics
1. Development practices
2. Administration practices
3. Interview Questions
4. Big Data integrations
5. Advanced Technologies in Big Data
6. Become more strong on Big Data

Call for Spark & Hadoop Training in Hyderabad, ORIENIT @ 040 65142345 , 9703202345

Showing posts with label Administration. Show all posts

Saturday, 26 November 2016

Sizing and Configuring your Hadoop Cluster

Sizing your Hadoop cluster

Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently.

Planning the Hadoop cluster remains a complex task that requires minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following:

How to plan my storage?
How to plan my CPU?
How to plan my memory?
How to plan the network bandwidth?

While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one.

Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components:

JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster
TaskTracker: This runs tasks on each node of the cluster

To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block.

HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer.

Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node.

Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails.

The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity.

Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster.

Daily data input	100 GB	Storage space used by daily data input = daily data input replication factor = 300 GB*
HDFS replication factor	3
Monthly growth	5%	Monthly volume = (300 30) + 5% = 9450 GB* After one year = 9450 (1 + 0.05)^12 = 16971 GB*
Intermediate MapReduce data	25%	Dedicated space = HDD size (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100)* = 4 (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity)*
Non HDFS reserved space per disk	30%
Size of a hard drive disk	4 TB
Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes

Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use.

It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount:

NameNode memory	2 GB - 4 GB	Memory amount = HDFS cluster management memory + NameNode memory + OS memory
Secondary NameNode memory	2 GB - 4 GB
OS memory	4 GB - 8 GB
HDFS memory	2 GB - 8 GB
At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB

It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode.

DataNode process memory	4 GB - 8 GB	Memory amount = Memory per CPU core number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory*
DataNode TaskTracker memory	4 GB - 8 GB
OS memory	4 GB - 8 GB
CPU's core number	4+
Memory per CPU core	4 GB - 8 GB
At least DataNode memory = 44 + 4 + 4 + 4 = 28 GB*

Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack.

Configuring your cluster correctly

To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job.

In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs.

The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance.

The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered.

The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode.

In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula:

When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode.

Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots.

Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples:

Cluster machine	Nb	Medium data size	Large data size
DataNode CPU cores	8	Reserve 1 CPU core	Reserve 2 CPU cores
DataNode TaskTracker daemon	1	1	1
DataNode HDFS daemon	1	1	1
Data block size		128 MB	256 MB
DataNode CPU % utilization		95% to 120%	95% to 150%
Cluster nodes		20	40
Replication factor		2	3

We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent.

Maximum mapper's slot numbers on one node in a large data context	= number of physical cores - reserved core (0.95 -> 1.5)* Reserved core = 1 for TaskTracker + 1 for HDFS
Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) 1.2 = 7.2 rounded down to 7*
Let's apply the 2/3 mappers / reducers technique: Maximum number of reducers slots = 7 2/3 = 5*
Let's define the number of slots for the cluster: Mapper's slots: = 7 40 = 280* Reducer's slots: = 5 40 = 200*

The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time.

Summary

In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce.

Resources for Article:

Further resources on this subject:

This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly.

Source: https://www.packtpub.com/books/content/sizing-and-configuring-your-hadoop-cluster

Tuesday, 18 October 2016

Introduction to SCALA: Installing Scala : Day 1 Learnings

As a JVM language, Scala requires the use of a Java runtime. Scala 2.11, the version you’ll be using, needs at least Java 6.

However, I recommend installing the Java 8 JDK (aka Java SE for Standard Environment) instead for optimal performance. You can download the Java 8 JDK (or a later version, if available) for most platforms directly from Oracle’s website. Installers are available, so you shouldn’t need to manually configure your PATH variable to get the applications installed.

When finished, verify your Java version by running java -version from the command line. Here is an example of running this command for Java 8:

$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

Now that Java is installed, it’s time to install Scala. There are two ways to install Scala (or any other fine programming tool): the manual approach, suitable for command line heroes who like to modify their system’s environment variables, and the automatic approach, for the rest of us.

To install Scala manually, download the Scala 2.11 distribution from http://www.scala-lang.org and add its “bin” directory to your path. The distribution includes the Scala run times, tools, compiled libraries, and source, but the most important item we’ll need is the scala command.

This command provides (among other features) the REPL (Read-Eval-Print-Loop) shell we will use to learn and experiment with the Scala language.

To install Scala automatically, use a package manager such as Homebrew for OS X, Chocolatey for Windows, or apt-get/Yum for Linux systems. These are freely available and will handle finding the package, downloading and extracting it, and installing it so you can access it from the command line.

The scala package is available in all of these package managers as “scala,” so you can install it with (brew/choco/apt-get-yum) install scala .

When installed, execute the scala command from the command line. You should see a welcome message like the following (although your Scala and Java version messages may be different):

$ scala

Welcome to Scala version 2.11.0 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_05).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

When you see the Welcome to Scala message and the scala> prompt you are now in the Scala REPL and are ready to start coding.

If the command is found but there are problems launching it, make sure your Java command is installed correctly and that your system path points to the correct Java version.

Introduction to SCALA : Why Scala ? Day 1 Learnings

http://kalyanbigdatatraining.blogspot.in/2016/10/introduction-to-scala-why-scala-day-1.html

Saturday, 15 October 2016

How to Copy Data From One Machine to Other Machine

---------------------------------------------------------------------------------------------------

How to install SSH using command line

http://kalyanbigdatatraining.blogspot.in/2016/09/how-to-install-ssh-using-command-line.html

How to disable the password using SSH

http://kalyanbigdatatraining.blogspot.in/2016/09/how-to-disable-password-using-ssh.html

---------------------------------------------------------------------------------------------------

How to Copy Data From One Machine (kalyan@orienit1) to Other Machine (kalyan@orienit2)

scp kalyan@orienit1:<source path> kalyan@orienit2:<destination path>

How to Copy Data From One Machine (kalyan@192.168.0.111) to Other Machine (kalyan@192.168.0.112)

scp kalyan@192.168.0.111:<source path> kalyan@192.168.0.112:<destination path>

---------------------------------------------------------------------------------------------------

How to Copy sample.txt file From One Machine (kalyan@orienit1) to Other Machine (kalyan@orienit2)

scp kalyan@orienit1:~/sample.txt kalyan@orienit2:~/sample.txt

How to Copy sample.txt file From One Machine (kalyan@192.168.0.111) to Other Machine (kalyan@192.168.0.112)

scp kalyan@192.168.0.111:~/sample.txt kalyan@192.168.0.112:~/sample.txt

---------------------------------------------------------------------------------------------------

Sunday, 18 September 2016

How to disable the password using SSH

1. using rsa algorithm

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

2. using dsa algorithm

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

3. verify using below command

ssh localhost

4. If still asking the password again and again then remove existing data

rm -r ~/.ssh

5. After removing the existing data .. repeat step1 or step2

Hadoop Cluster Practice Commands

How to Start / Stop Hadoop Process:

Name Node:
hadoop-daemon.sh start namenode
hadoop-daemon.sh stop namenode

Data Node:
hadoop-daemon.sh start datanode
hadoop-daemon.sh stop datanode

Secondary Name Node:
hadoop-daemon.sh start secondarynamenode
hadoop-daemon.sh stop secondarynamenode

Job Tracker:
hadoop-daemon.sh start jobtracker
hadoop-daemon.sh stop jobtracker

Task Tracker:

hadoop-daemon.sh start tasktracker
hadoop-daemon.sh stop tasktracker

Resource Manager:
yarn-daemon.sh start resourcemanager
yarn-daemon.sh stop resourcemanager

Node Manager:
yarn-daemon.sh start nodemanager
yarn-daemon.sh stop nodemanager

Job History Server:

mr-jobhistory-daemon.sh start historyserver
mr-jobhistory-daemon.sh stop historyserver

Hadoop Required URLS

Name Node : http://orienit1:50070
Resource Manager : http://orienit2:8088

1. Create a new folder with your hostname (i.e orienit1 / orienit2 / orienit3) using below command.

hadoop fs -mkdir /orienit

2. Put some files from Local File System to HDFS using below command

hadoop fs -put <local file system path> <hdfs path>

hadoop fs -put /etc/hosts /orienit/hosts

3. Read the Data from HDFS using below command

hadoop fs -cat /orienit/hosts

4. Change the Replication factor using below commands

Increase the replication number:

hadoop fs -setrep 5 /orienit/hosts

Decrease the replication number:

hadoop fs -setrep 3 /orienit/hosts

5. Transfer the data from one cluster to other cluster using below command

hadoop distcp hdfs://nn1:8020/<src path> hdfs://nn2:8020/<dst path>

where nn1 is first cluster namenode ip or hostname

where nn2 is second cluster namenode ip or hostname

6. Commissioning and Decommissioning the Nodes in Hadoop Cluster

In Name Node machine modify below changes

1. create include file in /home/kalyan/work folder

2. create exclude file in /home/kalyan/work folder

3. update hdfs-site.xml with below configurations

<name>dfs.hosts</name>

<value>/home/kalyan/work/include</value>

</property>

<property>
<name>dfs.hosts.exclude</name>

<value>/home/kalyan/work/exclude</value>

</property>

4. execute below command to reflect the hdfs changes

hadoop dfsadmin -refreshNodes

5. update yarn-site.xml with below configurations

<property>
<name>yarn.resourcemanager.nodes.include-path</name>
<value>/home/kalyan/work/include</value>
</property>

<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/home/kalyan/work/exclude</value>

</property>

6. execute below command to reflect the mr changes

yarn rmadmin -refreshNodes

7. verify the changes in browser

Thursday, 15 September 2016

How to install JAVA-1.8 in ubuntu

1. Install ORACLE JAVA-1.8 using below commands

sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
sudo apt-get install oracle-java8-installer

2. Verify ORACLE JAVA-1.8 using below command

java -version

3. Install OPENJDK JAVA-1.8 using below commands

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

4. Verify OPENJDK JAVA-1.8 using below command

java -version

How to install R in ubuntu

1. Install R using below commands

sudo apt-get update
sudo apt-get install r-base r-base-dev

2. Verify PYTHON using below command

R

How to install PYTHON in centos required version

1. Install PYTHON using below commands
cd /tmp

wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tgz

tar -xvzf Python-2.7.6.tgz

cd Python-2.7.6

./configure –prefix=/usr/local

make

make install

2. Open /etc/profile file using below command
sudo gedit /etc/profile

3. Add the below content /etc/profile file using below command
export PATH=/usr/local/bin:$PATH

4. Reboot the system

5. Verify PYTHON using below command
python

How to install PYTHON in centos

Install PYTHON using below command

sudo yum install python

Verify PYTHON using below command

python

How to install PYTHON in ubuntu

Install PYTHON using below command

sudo apt-get install python

Verify PYTHON using below command

python

Wednesday, 14 September 2016

How to install MYSQL using command line

Install MYSQL using below command

sudo apt-get install mysql-server mysql-client

Verify MYSQL using below command

mysql -u <username> -p

If username is `root` then below command

mysql -u root -p

Un-Install MYSQL using below command

sudo apt-get autoremove mysql-server mysql-client

How to change the HOSTNAME using command line

How to modify the hostname ( hadoop to kalyan)

1. Open /etc/hostname file using below command

sudo gedit /etc/hostname

2. Modify the /etc/hostname file with new hostname

update hadoop to kalyan

3. Save the /etc/hostname file

4. Open /etc/hosts file using below command

sudo gedit /etc/hosts

5. Modify the /etc/hosts file with new hostname

update hadoop to kalyan

127.0.0.1 localhost
127.0.0.1 kalyan

6. Save the /etc/hosts file

7. Execute below command to reflect the hostname changes

sudo service hostname restart

8. Re-Open the Terminal

9. Verify the new hostname using below command

hostname

How to install SSH using command line

Install SSH using below command

sudo apt-get install ssh

Verify SSH using below command

ssh localhost

How to use SSH command

ssh <username>@<ip address>

ssh <username>@<hostname>

How to install JAVA using command line

1. Java-1.6

Install Java-1.6 using below command

sudo apt-get install openjdk-6-jdk

Verify Java-1.6 using below command

java -version

2. Java-1.7

Install Java-1.7 using below command

sudo apt-get install openjdk-7-jdk

Verify Java-1.7 using below command

java -version

3. Java-1.8

Install Java-1.8 using below commands

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

Verify Java-1.8 using below command

java -version

Subscribe to: Posts ( Atom )