Monday 10 August 2015

Hadoop Ecosystem

Standard

https://hadoopecosystemtable.github.io/ 



What is Hadoop?
•    HDFS – Hadoop distributed file system
•    Distributed computation tier using programming of MapReduce
•    Sits on the low cost commodity servers connected together called Cluster
•    Consists of a Master Node or NameNode to control the processing
•    Data Nodes to store & process the data
•    JobTracker & TaskTracker to manage & monitor the jobs

How Hadoop Works:
•    Data is split into small blocks of 64 or 128MB and stored onto a minimum of 3
       machines at a time to ensure data availability & reliability
•    Many machines are connected in a cluster work in parallel for faster crunching of data
•    If any one machine fails, the work is assigned to another automatically
•    MapReduce breaks complex tasks into smaller chunks to be executed in parallel



http://www.pragsis.com/sites/default/files/ecosistema_bidoop.png
•    Hive – SQL like interface
•    Pig – data management language, like commercial tools AbInitio, Informatica,
•    Hbase – column oriented database on top of HDFS
•    Flume – real time data streaming such as credit card transaction, videos
•    Sqoop – SQL interface to RDBMS and HDFS
•    Zookeeper – a DBA management for Hadoop


Map reduce is the heart of Hadoop®. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The Map Reduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.
Note : Map reduce is using only for Batch processing.  Generation 1 and 2

Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory data structure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Spark streaming wraps data streams into mini-batches, i.e., it collects all data that arrives within a certain period of time and runs a regular batch program on the collected data. While the batch program is running, the data for the next mini-batch is collected.

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Note : in-memory very fast near real time processing.Generation 3

Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.

Note : Real time processing . Generation 4

Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Note : Hortonworks initiative

Yarn part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. It is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.

More details :https://hadoopecosystemtable.github.io/

0 comments:

Post a Comment