S.Word : July 2015

Friday 17 July 2015

JSON and JAQL

High level query language:
Pig (from Yahoo)
Jaql (from IBM)
Hive (from Facebook)

JSON:
JavaScript Open Notation
language independent like XML
Text based
JSON is not a document format
It is not a markup language

JSON File format:
Data types:
String
Number
Boolean
Objects
Arrays
Arrays : [1, 2, 3,...]
Objects are wrapped in {}
, separates key, value pairs
Example:
{
"Name":"Dnivog",
"at large":true,
"grade":"C",
"level":3,
"format":{"type:"rect", "width":1220}
}

Query laguage for JSON is JAQL.

JAQL is a scripting language.
semi structured analysis
parralelism
extensibility

JAQL can run from:
shell
Java Program
Eclipse

Query:
source, operators, sink

a sink is anything to which data can be written.

source-operator(parameters)-sink

Thursday 16 July 2015

DML commands in HBase

put : puts a cell avlue at specified column in specified row in a particular table

set : fetches the contents of row or cell

delete : delete a cell value in table

deleteall : delete all cells

scan : scan and return table data

ccount : returns the number of rows in a table.

truncate

Wednesday 15 July 2015

HBase

Distributed column oriented data store build on top of HDFS.

follows NoSQL

Data is logically organized into tables, rows and columns.

HDFS is only good for batch processing not good for rocord lookup, incremental addition of small batches, updates.

HBase supports record level insertion and updatations.

Group of columns is called as Column family.

Monday 13 July 2015

Serialization and Deserialization

Serialization is the process of converting strucured objects into a byte stream.

This is used for transmission of data between nodes.

The recieving node has to deserialize it.

persistant storage: permanent storage.

Writable Interface
{
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

Writable Comparable Interface;
public int compareTo(MyWritableComparable w)
{
int thisvalue=this.value;
int thatvalue=((IntWritable)o).value;
return (thisvalue<thatvalue?-1:(thisvalue==thatvalue?0:1))
}

The above function is needed to compare Composite keys.

Friday 10 July 2015

Composite key : MapReduce

temperature: single key
Composite key is a composition of multiple attributes.
Key will not be primitive [Text etc]
it will be an object.
Use:
implements WritableComparable<CompositeGroupKey>

Thursday 9 July 2015

HDFS using command Line

Start the services, all of them
write the progrm
compile:
classpath=/usr/local/hadoop/*.jar'
prepare jar file:
this should contain the main class, mapper class and reducer class.
execute.

Wednesday 8 July 2015

MapReduce Format

A deriver class
A mapper class
A reducer class

Drivers:
The job configuration
Define the mapper
Define the reducer
Specifies the paths of input & output files for the mapreduce program.

There are separate folders for input and output.

The user defined class for mapper must extend mapper class of hadoop and must have the following input types specifications:
<input key, input value, output key, output value>

There are datatypes like:
IntWritable
DoubleWritable
Text
BooleanWritable

Tokenizer obj=new String Tokenizer("this is me")
The above line of code will split the string into substrings at places where there is space.

The output of reduce is always sorted based on key (not the value)

Tuesday 7 July 2015

Introduction to MapReduce

Installing eclipse:
apt-get install eclipse_package

MapReduce has the features of Parallelism and DFS.

Map(): Gathering similar information from bigdata.

Reduce(): Reduce conclusions from the map output.

MapReduce is the generisation of Group's webpage indexes.

map(Data Document)
{
Partition the dcument on different nodes [automatic]
Generate (key, value) pair in parallel on every node.
return (key, value) pairs.
}
reduce((key, va,ue) pairs)
{
Shuffle and Sort [Usually done beforehand]
reduce [merge, aggregate, etc]
return answer
}

Monday 6 July 2015

Installation process of Hadoop

1. Install Java
2. Create user for Hadoop
3. Configuring ssh server
4. Create RSA key
5. Disable IPV6
6. Restart
7. Install Hadoop
8. Formatting HDFS
9. Start Namenode
10. Stop Namenode

Friday 3 July 2015

Hadoop Ecosystem

Hive: Adata warehouse to store structured data of hdfs.
Pig: High level language for data analysis.
Zookeeper: Coordination of Distributed applications.
Sqoop: Command line Interface.

Thursday 2 July 2015

Comparision between RDBMS and BigData and some other concepts

RDBMS can store data in the range of GBs while MapReduce can go upto and beyond PBs. RDBMS can interactive and Batch while MapReduce is only Batch in terms of nature of processing and the speed of access. In RDBMS, reads and writes of data are done quite frequently. In MapReduce system, writes are done once and reads (for analysis) are done many times. The Structure of RDBMS is static in nature (fixed), while there is dynamic schemas in MapReduce based systems. The integrity of data is higher in RDBMS than in MapReduce. But when it comes to scaling, then RDBMS lacks way behind BigData solutions.

Scaling:
Horizontal: adding more machines.
Vertical: enhance the hardware capabilities of existing machine.

There are 2 layers of Hadoop:
HDFS,
execution engine(MapReduce based)

Among all the servers in a cluster, one machine is the Master and others are slaves. The Master machine contains the metadata about the data stored in the slave machines.

There is secondary/backup of master server as well.

MetaData is like the index of a book.

JobTracker keeps track of which task is given to which server.

JobTracker runs only in the Master Unit/server.

Nodes in a cluster are generally your regular commodity PCs.

HDFS is not suitable for:
Low latency data access.
Lots of small files.
Multiple writes needed.
Arbitrary modifications.

Namenode contains the metabyte. It maintains data tree. Default block size is 64 MB.

Similarly JobTracker is only entity in the namenode and there are many TaskTrackers oon the slave servers.

By default replication is done on 3 places.

Survoelance of tasks is done by mappers.

Wednesday 1 July 2015

Hadoop Framework

Hadoop Framework:
Architecture:
Master/Slave

Servers in a cluster must always be connected in a LAN
But all servers in a LAN may not be in a cluster.

500GB stored in 59 seconds by Yahoo.

The file system that manages storage across a network of machines is called Distributed File System.

Replication is done to avoid data loss.

Replication can be full and Partial.

Replication is directly proportional to Fetching ease which directly proportional to wastage of storage space which is finally inversly proportional to updation ease and time.

HDFS is a layer above the OS' file system.

The principle is Write once read many times.

Primary and Secondary nodes are desiginated During replication.

Labels