Kalyan Hadoop and Spark Training in Hyderabad Learn Big Data From Basics... @ Kalyan @

Mr.Kalyan, Apache Contributor, Cloudera CCA175 Certified Consultant, 8+ years of Big Data exp, IIT Kharagpur, Gold Medalist.

This blog is mainly meant for Learn Big Data From Basics
1. Development practices
2. Administration practices
3. Interview Questions
4. Big Data integrations
5. Advanced Technologies in Big Data
6. Become more strong on Big Data

Call for Spark & Hadoop Training in Hyderabad, ORIENIT @ 040 65142345 , 9703202345

Showing posts with label Json. Show all posts

Thursday, 6 October 2016

How To Stream JSON Data Into Hive Using Apache Flume

Pre-Requisites of Flume + Hive Project:

hadoop-2.6.0
flume-1.6.0
hive-1.2.1
java-1.7

NOTE: Make sure that install all the above components

Flume + Hive Project Download Links:

`hadoop-2.6.0.tar.gz` ==> link
`apache-flume-1.6.0-bin.tar.gz` ==> link
`apache-hive-1.2.1-src.tar.gz` ==> link
`kalyan-json-hive-agent.conf` ==> link
`bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar` ==> link

-----------------------------------------------------------------------------

1. create "kalyan-json-hive-agent.conf" file with below content

agent.sources = EXEC
agent.sinks = HIVE
agent.channels = MemChannel

agent.sources.EXEC.type = exec
agent.sources.EXEC.command = tail -F /tmp/users.json
agent.sources.EXEC.channels = MemChannel

agent.sinks.HIVE.type = hive
agent.sinks.HIVE.hive.metastore = thrift://localhost:9083
agent.sinks.HIVE.hive.database = kalyan
agent.sinks.HIVE.hive.table = users2
agent.sinks.HIVE.serializer = JSON
agent.sinks.HIVE.channel = MemChannel

agent.channels.MemChannel.type = memory
agent.channels.MemChannel.capacity = 1000
agent.channels.MemChannel.transactionCapacity = 100

2. Copy "kalyan-json-hive-agent.conf" file into "$FUME_HOME/conf" folder

3. Copy "bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar" file into "$FLUME_HOME/lib" folder

4. Generate Large Amount of Sample JSON data follow this article.

5. Execute Below Command to Generate Sample JSON data with 100 lines. Increase this number to get more data ...

java -cp $FLUME_HOME/lib/bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.json \
-n 100 \
-s 1

6. Verify the Sample JSON data in Console, using below command

cat /tmp/users.json

7. To work with Flume + Hive Integration

Follow the below steps

Follow this aritcle to work with below procedure.

Refer: http://kalyanbigdatatraining.blogspot.in/2016/10/how-to-work-with-acid-functionality-in.html

i. update '~/.bashrc' file with below changes

export HIVE_HOME=/home/orienit/work/apache-hive-1.2.1-bin
export PATH=$HIVE_HOME/bin:$PATH

export HCAT_HOME=$HIVE_HOME/hcatalog
export PATH=$HCAT_HOME/bin:$PATH

ii. reopen the Terminal

iii. start the hive using 'hive' command.

iv. list out all the databases in hive using 'show databases;' command

v. create a new database (kalyan) in hive using below command.

create database if not exists kalyan;

vi. use kalyan database using 'use kalyan;' command

vii. list out all the tables in kalyan database using 'show tables;' command.

viii. create 'users2' table in kalyan database using below command.

CREATE TABLE IF NOT EXISTS kalyan.users2 (
userid BIGINT,
username STRING,
password STRING,
email STRING,
country STRING,
state STRING,
city STRING,
dt STRING
)
clustered by (userid) into 5 buckets stored as orc;

ix. Display the data from 'users2' table using below command

select * from users2;

x. start the hive in external metastore db mode using below command

hive --service metastore

8. Execute the below command to `Extract data from JSON data into Hive using Flume`

$FLUME_HOME/bin/flume-ng agent -n agent --conf $FLUME_HOME/conf -f $FLUME_HOME/conf/kalyan-json-hive-agent.conf -Dflume.root.logger=DEBUG,console

9. Verify the data in console

10. Verify the data in Hive

Execute below command to get the data from hive table 'users2'

select * from users2;

How To Stream JSON Data Into Phoenix Using Apache Flume

Pre-Requisites of Flume Project:

hadoop-2.6.0
flume-1.6.0
hbase-1.1.2
phoenix-4.7.0
java-1.7

NOTE: Make sure that install all the above components

Flume Project Download Links:

`hadoop-2.6.0.tar.gz` ==> link
`apache-flume-1.6.0-bin.tar.gz` ==> link
`hbase-1.1.2-bin.tar.gz` ==> link
`phoenix-4.7.0-HBase-1.1-bin.tar.gz` ==> link
`kalyan-json-phoenix-agent.conf` ==> link
`bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar` ==> link
`phoenix-flume-4.7.0-HBase-1.1.jar` ==> link
`json-path-2.2.0.jar` ==> link
`commons-io-2.4.jar` ==> link

-----------------------------------------------------------------------------

1. create "kalyan-json-phoenix-agent.conf" file with below content

agent.sources = EXEC
agent.channels = MemChannel
agent.sinks = PHOENIX

agent.sources.EXEC.type = exec
agent.sources.EXEC.command = tail -F /tmp/users.json
agent.sources.EXEC.channels = MemChannel

agent.sinks.PHOENIX.type = org.apache.phoenix.flume.sink.PhoenixSink
agent.sinks.PHOENIX.batchSize = 10
agent.sinks.PHOENIX.zookeeperQuorum = localhost
agent.sinks.PHOENIX.table = users2
agent.sinks.PHOENIX.ddl = CREATE TABLE IF NOT EXISTS users2 (userid BIGINT NOT NULL, username VARCHAR, password VARCHAR, email VARCHAR, country VARCHAR, state VARCHAR, city VARCHAR, dt VARCHAR NOT NULL CONSTRAINT PK PRIMARY KEY (userid, dt))
agent.sinks.PHOENIX.serializer = json
agent.sinks.PHOENIX.serializer.columnsMapping = {"userid":"userid", "username":"username", "password":"password", "email":"email", "country":"country", "state":"state", "city":"city", "dt":"dt"}
agent.sinks.PHOENIX.serializer.partialSchema = true
agent.sinks.PHOENIX.serializer.columns = userid,username,password,email,country,state,city,dt
agent.sinks.PHOENIX.channel = MemChannel

agent.channels.MemChannel.type = memory
agent.channels.MemChannel.capacity = 1000
agent.channels.MemChannel.transactionCapacity = 100

2. Copy "kalyan-json-phoenix-agent.conf" file into "$FUME_HOME/conf" folder

3. Copy "phoenix-flume-4.7.0-HBase-1.1.jar, json-path-2.2.0.jar, commons-io-2.4.jar and bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar" files into"$FLUME_HOME/lib" folder

4. Generate Large Amount of Sample JSON data follow this article.

5. Execute Below Command to Generate Sample JSON data with 100 lines. Increase this number to get more data ...

java -cp $FLUME_HOME/lib/bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.json \
-n 100 \
-s 1

6. Verify the Sample JSON data in Console, using below command

cat /tmp/users.json

7. To work with Flume + Phoenix Integration

Follow the below steps

i. start the hbase using below 'start-hbase.sh' command.

ii. verify the hbase is running or not with "jps" command

iii. Start the phoenix using below 'sqlline.py localhost' command.

iv. list out all the tables in phoenix using '!tables' command

8. Execute the below command to `Extract data from JSON data into Phoenix using Flume`

$FLUME_HOME/bin/flume-ng agent -n agent --conf $FLUME_HOME/conf -f $FLUME_HOME/conf/kalyan-json-phoenix-agent.conf -Dflume.root.logger=DEBUG,console

9. Verify the data in console

10. Verify the data in Phoenix

Execute below command to get the data from phoenix table 'users2'

!tables

select count(*) from users2;

select * from users2;

Wednesday, 5 October 2016

How To Stream JSON Data Into HBase Using Apache Flume

Pre-Requisites of Flume Project:
hadoop-2.6.0
flume-1.6.0
hbase-0.98.4
java-1.7

Project Compatibility :
1. hadoop-2.6.0 + hbase-0.98.4 + flume-1.6.0
2. hadoop-2.7.2 + hbase-1.1.2 + flume-1.7.0

NOTE: Make sure that install all the above components

Flume Project Download Links:
`hadoop-2.6.0.tar.gz` ==> link
`apache-flume-1.6.0-bin.tar.gz` ==> link
`kalyan-json-hbase-agent.conf` ==> link
`kalyan-flume-project-0.1.jar` ==> link
`bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar` ==> link

-----------------------------------------------------------------------------

1. create "kalyan-json-hbase-agent.conf" file with below content

agent.sources = EXEC
agent.channels = MemChannel
agent.sinks = HBASE

agent.sources.EXEC.type = exec
agent.sources.EXEC.command = tail -F /tmp/users.json
agent.sources.EXEC.channels = MemChannel

agent.sinks.HBASE.type = hbase
agent.sinks.HBASE.table = users2
agent.sinks.HBASE.columnFamily = cf
agent.sinks.HBASE.serializer = com.orienit.kalyan.flume.sink.JsonHbaseEventSerializer
agent.sinks.HBASE.serializer.colNames=userid,username,password,email,country,state,city,dt
agent.sinks.HBASE.channel = MemChannel

agent.channels.MemChannel.type = memory
agent.channels.MemChannel.capacity = 1000
agent.channels.MemChannel.transactionCapacity = 100

2. Copy "kalyan-json-hbase-agent.conf" file into "$FUME_HOME/conf" folder

3. Copy "kalyan-flume-project-0.1.jar and bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar" files into "$FLUME_HOME/lib" folder

4. Generate Large Amount of Sample JSON data follow this article.

5. Execute Below Command to Generate Sample JSON data with 100 lines. Increase this number to get more data ...

java -cp $FLUME_HOME/lib/bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.json \
-n 100 \
-s 1

6. Verify the Sample JSON data in Console, using below command

cat /tmp/users.json

7. To work with Flume + HBase Integration

Follow the below steps

1. start the hbase using below command.

start-hbase.sh

2. verify the hbase is running or not with "jps" command

3. connect to hbase using below command

hbase shell

4. list out all the tables in hbase using 'list' command

5. create the hbase table name is 'users2' with column family name is 'cf' using below command.

create 'users2', 'cf'

6. read the data from hbase table 'users2' using below command.scan 'users2'

8. Execute the below command to `Extract data from JSON data into HBase using Flume`

$FLUME_HOME/bin/flume-ng agent -n agent --conf $FLUME_HOME/conf -f $FLUME_HOME/conf/kalyan-json-hbase-agent.conf -Dflume.root.logger=DEBUG,console

9. Verify the data in console

10. Verify the data in HBase

Execute below command to get the data from hbase table 'users2'

count 'users2'

scan 'users2'

How to generate large amount of sample data with simple techniques

Download the Required jar file from this link.

Generate Sample Users using Below Command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers

usage: help
-d,--delimiter Field Delimiter, bydefault is json format
-f,--file output file path
-h,--help Show this help and quit
-n,--numberOfUsers number of users
-s,--startNumber starting number of userid, bydefault is 1

Example1 : To Generate json data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.json \
-n 10 \
-s 1

Example2 : To Generate csv data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.csv \
-d ',' \
-n 10 \
-s 1

Example3 : To Generate tsv data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.tsv \
-d '\t' \
-n 10 \
-s 1

Example4 : To Generate any delimiter data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateUsers \
-f /tmp/users.txt \
-d '#' \
-n 10 \
-s 1

Generate Sample Product Log using Below Command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateProductLog

usage: help
-d,--delimiter Field Delimiter, bydefault is json format
-f,--file output file path
-h,--help Show this help and quit
-l,--numberOfLogs number of logs
-n,--numberOfUsers number of users

Example1 : To Generate json data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateProductLog \
-f /tmp/productlog.json \
-n 10 \
-l 20

Example2 : To Generate csv data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateProductLog \
-f /tmp/productlog.csv \
-d ',' \
-n 10 \
-l 20

Example3 : To Generate tsv data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateProductLog \
-f /tmp/productlog.tsv \
-d '\t' \
-n 10 \
-l 20

Example4 : To Generate any delimiter data, use below command

java -cp bigdata-examples-0.0.1-SNAPSHOT-dependency-jars.jar \
com.orienit.kalyan.examples.GenerateProductLog \
-f /tmp/productlog.txt \
-d '#' \
-n 10 \
-l 20

Wednesday, 28 September 2016

How To Stream Twitter Data Into Hadoop and MongoDB Using Apache Flume

Pre-Requisites of Flume Project:
hadoop-2.6.0
flume-1.6.0
mongodb-3.2.7
java-1.7

NOTE: Make sure that install all the above components

Flume Project Download Links:
`hadoop-2.6.0.tar.gz` ==> link
`apache-flume-1.6.0-bin.tar.gz` ==> link
`mongodb-linux-x86_64-ubuntu1404-3.2.7.tgz` ==> link
`kalyan-twitter-hdfs-mongo-agent.conf` ==> link
`kalyan-flume-project-0.1.jar` ==> link
`mongodb-driver-core-3.3.0.jar` ==> link
`mongo-java-driver-3.3.0.jar` ==> link
-----------------------------------------------------------------------------

1. create "kalyan-twitter-hdfs-mongo-agent.conf" file with below content

agent.sources = Twitter
agent.channels = MemChannel1 MemChannel2
agent.sinks = HDFS MongoDB

agent.sources.Twitter.type = com.orienit.kalyan.flume.source.KalyanTwitterSource
agent.sources.Twitter.channels = MemChannel1 MemChannel2
agent.sources.Twitter.consumerKey = ********
agent.sources.Twitter.consumerSecret = ********
agent.sources.Twitter.accessToken = ********
agent.sources.Twitter.accessTokenSecret = ********
agent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

agent.sinks.HDFS.type = hdfs
agent.sinks.HDFS.channel = MemChannel1
agent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets
agent.sinks.HDFS.hdfs.fileType = DataStream
agent.sinks.HDFS.hdfs.writeFormat = Text
agent.sinks.HDFS.hdfs.batchSize = 100
agent.sinks.HDFS.hdfs.rollSize = 0
agent.sinks.HDFS.hdfs.rollCount = 100
agent.sinks.HDFS.hdfs.useLocalTimeStamp = true

agent.sinks.MongoDB.type = com.orienit.kalyan.flume.sink.KalyanMongoSink
agent.sinks.MongoDB.hostNames = localhost
agent.sinks.MongoDB.database = flume
agent.sinks.MongoDB.collection = twitter
agent.sinks.MongoDB.batchSize = 10
agent.sinks.MongoDB.channel = MemChannel2

agent.channels.MemChannel1.type = memory
agent.channels.MemChannel1.capacity = 1000
agent.channels.MemChannel1.transactionCapacity = 100

agent.channels.MemChannel2.type = memory
agent.channels.MemChannel2.capacity = 1000
agent.channels.MemChannel2.transactionCapacity = 100

2. Copy "kalyan-twitter-hdfs-mongo-agent.conf" file into "$FUME_HOME/conf" folder

3. Copy "kalyan-flume-project-0.1.jar, mongodb-driver-core-3.3.0.jar and mongo-java-driver-3.3.0.jar " files into"$FLUME_HOME/lib" folder

4. Execute the below command to `Extract data from Twitter into HDFS & MongoDB using Flume`

$FLUME_HOME/bin/flume-ng agent -n agent --conf $FLUME_HOME/conf -f $FLUME_HOME/conf/kalyan-twitter-hdfs-mongo-agent.conf -Dflume.root.logger=DEBUG,console

5. Verify the data in console

6. Verify the data in HDFS and MongoDB

7. Start the MongoDB Server using below command

8. Start the MongoDB client using below command (mongo)

9. Verify the List of DataBases in MongoDB using below command (show dbs)

10. Verify the List of Operations in MongoDB using below commands

// list of databases
show dbs

// use flume database
use flume

// list of collections
show collections

// find the count of documents in 'twitter' collection
db.twitter.count()

// display list of documents in 'twitter' collection

db.twitter.find()

Subscribe to: Posts ( Atom )