The Hadoop Distributed File System (HDFS) is
designed to store very large data sets reliably, and to stream those data sets
at high bandwidth to user applications. In a large cluster, thousands of
servers both host directly attached storage and execute user application tasks.
By distributing storage and computation across many servers, the resource can
grow with demand while remaining economical at every size.
HDFS stores
metadata (a set
of data that describes and gives information about other data) on a
dedicated server, called the NameNode. Application data are stored on other
servers called DataNodes. All servers are fully connected and
communicate with each other using TCP-based protocols.
What is a Name Node:
When writing data, the client requests the
Name Node to nominate a suite of three Data Nodes to host the block replicas.
The client then writes data to the Data Nodes in a pipeline fashion.
What is Region Server: Region server(s) essentially buffer I/O operations. When write request comes to RegionServer it first writes changes into memory and
commit log; then at some point it decides that it is time to write changes to
permanent storage on HDFS. Region is data in some range of rows.
Say, you want to get a row from HBase table. You request will get to RegionServer which is responsible for region
containing your row. RegionServer will either already contain your row in
memory (caching), or it needs to read it from HDFS (dataNodes).
No comments:
Post a Comment