Wednesday, 12 August 2015

Apache Kafka

Standard



Introduction:

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.  

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.  

Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

 Key characteristics of Apache Kafka:
  • Constant-time (O(1)) performance even with increased data loads
  • Real-time focus – immediate consumption of produced messages
  • Distribution of message consumption over a cluster of machines
  • Consumers hold the state of message ordering, allowing consumers to “rollback” time and review old messages
Designed to support the following
  • Designed to support millions of messages per second.
  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.
Apache  Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

 

Cluster configuration
Steps to setting up a clustered (multi-node, multi-broker) Kafka configuration:
  1. Download Apache Kafka to every node in your cluster, assume it is installed at  $KAFKA_HOME
  2. Add number of configuration files under $KAFKA_HOME/config , each representing the configuration of a single broker. Copy the contents of $KAFKA_HOME/config/server.properties  into something like  $KAFKA_HOME/config/server-X.properties
  3. For each configuration file:
    1. Make sure to change the broker.id  property for each configuration file to be an integer representing the individual broker
    2. Set the zookeeper.connect  property to the ensemble of nodes that your Zookeeper instance is running on. NOTE: it’s good practice to add a directory name at the end of the Zookeeper URL and port to enable sharing of Zookeeper with other applications. e.g.  host:port/kafka
    3. For very data heavy real-time applications, consider setting  log.retention.hours=1  and  log.cleaner.enable=true  to prevent too much log data being written to disk

references :
 http://kafka.apache.org/
http://hortonworks.com/hadoop/kafka/
http://tech.lalitbhatt.net/2014/07/apache-kafka-tutorial.html
http://www.quora.com/What-are-the-differences-between-Apache-Kafka-and-RabbitMQ
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ 
 

 

0 comments:

Post a Comment