- Hadoop is an open source project of the Apache Foundation.
- It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant.
- Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware,that is, relatively inexpensive computers.
- Hadoop replicates its data across multiple computers, so that if one goes down, the data is processed on one of the replicated computers. It is a batch operation handling massive quantities of data, so the response time is not immediate.
- Hadoop is not good to process transactions due to its lack random access.
- Hadoop is not suitable for OnLine Transaction Processing workloads where data is randomly accessed on structured data like a relational database.
- Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data is sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.
- It is NOT a replacement for a relational database system.
This site contains code snippets that I develop while learning and experimenting with SAS, R and Linux.
Sunday, December 7, 2014
Initials of HADOOP
Terminologies related to Hadoop
- Eclipse is a popular IDE donated by IBM to the open source community.
- Lucene is a text search engine library written in Java.
- Hbase is the Hadoop database.
- Hive provides data warehousing tools to extract, transform and load data, and then, query this data stored in Hadoop files.
- Pig is a high level language that generates MapReduce code to analyze large data sets.
- Jaql is a query language for JavaScript open notation.
- ZooKeeper is a centralized configuration service and naming registry for large distributed systems.
- Avro is a data serialization system.
- UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data .
Subscribe to:
Posts (Atom)