The operation of apps, web services and server ap­pli­ca­tions, to name but a few, presents a variety of chal­lenges for those running them. For example, one of the most common chal­lenges is ensuring that data stream trans­mis­sions are unimpeded and that they are processed as quickly and ef­fi­cient­ly as possible. The messaging and streaming ap­pli­ca­tion Apache Kafka is a piece of software that greatly sim­pli­fies this challenge. Orig­i­nal­ly developed as a message queuing service for LinkedIn, this open source solution now provides a com­pre­hen­sive solution for data storage, transfer and pro­cess­ing.

What is Kafka?

Apache Kafka is a platform-in­de­pen­dent open source ap­pli­ca­tion belonging to the Apache Software Foun­da­tion which focuses on data stream pro­cess­ing. The project was orig­i­nal­ly launched in 2011 by LinkedIn, the company behind the social network for pro­fes­sion­als bearing the same name. The aim was to develop a message queue. Since its license-free launch (Apache 2.0), this software’s ca­pa­bil­i­ties have been greatly extended, trans­form­ing what was a simple queuing ap­pli­ca­tion into a powerful streaming platform with a wide range of functions. It is used by well-known companies such as Netflix, Microsoft and Airbnb.

Founded by the original de­vel­op­ers of Apache Kafka in 2014, Confluent delivers the most complete version of Apache Kafka with Confluent Platform. It extends the program with ad­di­tion­al functions, some of which are also open source, while others are com­mer­cial.

What are Apache Kafka’s core functions?

Apache Kafka is primarily designed to optimize the trans­mis­sion and pro­cess­ing of data streams trans­ferred via a direct con­nec­tion between the data receiver and data source. Kafka acts as a messaging instance between the sender and the receiver, providing solutions to the common chal­lenges en­coun­tered with this type of con­nec­tion.

For example, the Apache platform provides a solution to the inability to cache data or messages when the receiver is not available (e.g. due to network problems). In addition, a properly set-up Kafka queue prevents the sender from over­load­ing the receiver. This type of situation always occurs when in­for­ma­tion is sent faster than it can be received and processed during a direct con­nec­tion. Lastly, the Kafka software is also ideal for sit­u­a­tions in which the target system receives the message but crashes during pro­cess­ing. While the sender would normally assume that pro­cess­ing has occurred despite the crash, Apache Kafka reports the failure to the sender.

Unlike pure message queuing services such as databases, Apache Kafka is fault tolerant. This means that the software satisfies re­quire­ments to continue pro­cess­ing messages and data. Combined with its high scal­a­bil­i­ty and ability to dis­trib­ute trans­mit­ted in­for­ma­tion across any number of systems (dis­trib­uted trans­ac­tion log), this makes Apache Kafka an excellent solution for all services which need to ensure that data is stored and processed quickly while main­tain­ing high avail­abil­i­ty.

An overview of the Apache Kafka ar­chi­tec­ture

Apache Kafka runs as a cluster on one or more servers that can span multiple data centers. The in­di­vid­ual servers in the cluster, known as brokers, store and cat­e­go­rize incoming data streams into topics. The data is divided into par­ti­tions, repli­cat­ed and dis­trib­uted within the cluster and assigned a time stamp. As a result, the streaming platform ensures high avail­abil­i­ty and a fast read access time. Apache Kafka dif­fer­en­ti­ates between normal topics and compacted topics. In normal topics, Kafka can delete messages as soon as the storage period or storage limit has been exceeded, whereas in compacted topics, they are not subject to time or space lim­i­ta­tions.

Ap­pli­ca­tions which write data in a Kafka cluster are called producers, while ap­pli­ca­tions which read data from a Kafka cluster are called consumers. The central component accessed by producers and consumers when pro­cess­ing data streams is a Java library called Kafka Streams. By sup­port­ing trans­ac­tion­al writing, this ensures that messages are only delivered once (with no du­pli­cates). This is called exactly-once delivery.

Note

The Kafka Streams Java library is the rec­om­mend­ed standard solution for pro­cess­ing data in Kafka clusters. However, you can use Apache Kafka with other data stream pro­cess­ing systems as well.

The technical basics: Kafka’s in­ter­faces

This software offers five different basic in­ter­faces to give ap­pli­ca­tions access to Apache Kafka:

  • Kafka Producer: The Kafka Producer API allows ap­pli­ca­tions to send data streams to the broker(s) in an Apache cluster in order to be cat­e­go­rized and stored (in the pre­vi­ous­ly mentioned topics).
  • Kafka Consumer: The Kafka Consumer API gives Apache Kafka consumers read access to data stored in the cluster’s topics.
  • Kafka Streams: The Kafka Streams API allows an ap­pli­ca­tion to act as a stream processor to convert incoming data streams into outgoing data streams.
  • Kafka Connect: The Kafka Connect API makes it possible to build reusable producers and consumers which connect Kafka topics with existing ap­pli­ca­tions or database systems.
  • Kafka Ad­min­Client: The Kafka Ad­min­Client API makes it possible to easily manage and inspect Kafka clusters.

Com­mu­ni­ca­tion between client ap­pli­ca­tions and in­di­vid­ual servers in Apache clusters occurs via a simple, powerful and language-in­de­pen­dent protocol based on TCP. The de­vel­op­ers provide a Java client for Apache Kafka by default, but there are also clients in a variety of other languages such as PHP, Python, C/C++, Ruby, Perl and Go.

Use case scenarios for Apache Kafka

From the outset, Apache Kafka was designed for high read and write through­put. When combined with the pre­vi­ous­ly mentioned APIs and its high flex­i­bil­i­ty, scal­a­bil­i­ty and fault tolerance, this makes the open source software appealing for a variety of use cases. Apache Kafka is par­tic­u­lar­ly well-suited for the following ap­pli­ca­tions:

  • Pub­lish­ing and sub­scrib­ing to data streams: This open source project started out by using Apache Kafka as a messaging system. Although the software’s functions have been extended, it is still best suited for direct message trans­mis­sion via the queuing system as well as for broadcast message trans­mis­sion.
  • Pro­cess­ing data streams: Apache Kafka is a powerful tool for ap­pli­ca­tions which need to react in real time to specific events and which need to process data streams as quickly and ef­fec­tive­ly as possible for this purpose.
  • Storing data streams: Apache Kafka can also be used as a fault-tolerant, dis­trib­uted storage system, whether it’s 50 kilobytes or 50 terabytes of con­sis­tent data that you need to store on the server(s).

Naturally, all these elements can be combined as desired, allowing Apache Kafka, as a complex streaming platform, to not only store data and make it available at any time, but also to process it in real time and link it to all desired ap­pli­ca­tions and systems.

An overview of common use cases for Apache Kafka:

  • Messaging system
  • Web analytics
  • Storage system
  • Data stream processor
  • Event sourcing
  • Log file analysis and man­age­ment
  • Mon­i­tor­ing solutions
  • Trans­ac­tion log
Go to Main Menu