Friday, August 8, 2014

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size.

HDFS stores metadata (a set of data that describes and gives information about other data) on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols.

What is a Name Node: When writing data, the client requests the Name Node to nominate a suite of three Data Nodes to host the block replicas. The client then writes data to the Data Nodes in a pipeline fashion.

What is Data Node: A Data Node stores data in the HadoopFileSystem. Client applications can talk directly to a Data Node, once the Name Node has provided the location of the data

What is Region Server: Region server(s) essentially buffer I/O operations. When write request comes to RegionServer it first writes changes into memory and commit log; then at some point it decides that it is time to write changes to permanent storage on HDFS. Region is data in some range of rows. Say, you want to get a row from HBase table. You request will get to RegionServer which is responsible for region containing your row. RegionServer will either already contain your row in memory (caching), or it needs to read it from HDFS (dataNodes).

No comments: