Friday, August 8, 2014

Mongo Database

MongoDB is a document database that provides high performance, high availability, and easy scalability. Documents (objects) map nicely to programming language data types.

MongoDB Data Model

A MongoDB deployment hosts a number of databases. A database holds a set of collections. A collection holds a set of documents. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

MongoDB Queries


Queries in MongoDB provides a set of operators to define how the find() method selects documents from a collection based on a query specification document that uses a combination of exact equality matches and conditionals using a query operator.

Shard - A database shard is a horizontal partition in a database or search engine. Each individual partition is referred to as a shard or database shard.

Sharding is the process of storing data records across multiple machines and is MongoDB's approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput.

API - an application programming interface (API) specifies how some software components should interact with each other. It is a set of programming instructions and standards for accessing a Web-based software application or Web tool.

An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention. When you buy movie tickets online and enter your credit card information, the movie ticket Web site uses an API to send your credit card information to a remote application that verifies whether your information is correct. Once payment is confirmed, the remote application sends a response back to the movie ticket Web site saying it's OK to issue the tickets.

Config servers are special mongod instances that store the metadata for a sharded cluster. Config servers use a two-phase commit to ensure immediate consistency and reliability. Config servers do not run as replica sets. All config servers must be available to deploy a sharded cluster or to make any changes to cluster metadata.

Solr is an open source enterprise search platform. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size.

HDFS stores metadata (a set of data that describes and gives information about other data) on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols.

What is a Name Node: When writing data, the client requests the Name Node to nominate a suite of three Data Nodes to host the block replicas. The client then writes data to the Data Nodes in a pipeline fashion.

What is Data Node: A Data Node stores data in the HadoopFileSystem. Client applications can talk directly to a Data Node, once the Name Node has provided the location of the data

What is Region Server: Region server(s) essentially buffer I/O operations. When write request comes to RegionServer it first writes changes into memory and commit log; then at some point it decides that it is time to write changes to permanent storage on HDFS. Region is data in some range of rows. Say, you want to get a row from HBase table. You request will get to RegionServer which is responsible for region containing your row. RegionServer will either already contain your row in memory (caching), or it needs to read it from HDFS (dataNodes).