Apache Kafka

Contents

The operation of apps, web services and server applications, to name but a few, presents a variety of challenges for those running them. For example, one of the most common challenges is ensuring that data stream transmissions are unimpeded and that they are processed as quickly and efficiently as possible. The messaging and streaming application Apache Kafka is a piece of software that greatly simplifies this challenge. Originally developed as a message queuing service for LinkedIn, this open source solution now provides a comprehensive solution for data storage, transfer and processing.

What is Kafka?

Apache Kafka is a platform-independent open source application belonging to the Apache Software Foundation which focuses on data stream processing. The project was originally launched in 2011 by LinkedIn, the company behind the social network for professionals bearing the same name. The aim was to develop a message queue. Since its license-free launch (Apache 2.0), this software’s capabilities have been greatly extended, transforming what was a simple queuing application into a powerful streaming platform with a wide range of functions. It is used by well-known companies such as Netflix, Microsoft and Airbnb.

Founded by the original developers of Apache Kafka in 2014, Confluent delivers the most complete version of Apache Kafka with Confluent Platform. It extends the program with additional functions, some of which are also open source, while others are commercial.

What are Apache Kafka’s core functions?

Apache Kafka is primarily designed to optimize the transmission and processing of data streams transferred via a direct connection between the data receiver and data source. Kafka acts as a messaging instance between the sender and the receiver, providing solutions to the common challenges encountered with this type of connection.

For example, the Apache platform provides a solution to the inability to cache data or messages when the receiver is not available (e.g. due to network problems). In addition, a properly set-up Kafka queue prevents the sender from overloading the receiver. This type of situation always occurs when information is sent faster than it can be received and processed during a direct connection. Lastly, the Kafka software is also ideal for situations in which the target system receives the message but crashes during processing. While the sender would normally assume that processing has occurred despite the crash, Apache Kafka reports the failure to the sender.

Unlike pure message queuing services such as databases, Apache Kafka is fault tolerant. This means that the software satisfies requirements to continue processing messages and data. Combined with its high scalability and ability to distribute transmitted information across any number of systems (distributed transaction log), this makes Apache Kafka an excellent solution for all services which need to ensure that data is stored and processed quickly while maintaining high availability.

An overview of the Apache Kafka architecture

Apache Kafka runs as a cluster on one or more servers that can span multiple data centers. The individual servers in the cluster, known as brokers, store and categorize incoming data streams into topics. The data is divided into partitions, replicated and distributed within the cluster and assigned a time stamp. As a result, the streaming platform ensures high availability and a fast read access time. Apache Kafka differentiates between normal topics and compacted topics. In normal topics, Kafka can delete messages as soon as the storage period or storage limit has been exceeded, whereas in compacted topics, they are not subject to time or space limitations.

Each broker in a Kafka cluster can host one or more topics. These in turn can be split into different partitions and made available to as many consumers as desired.

Applications which write data in a Kafka cluster are called producers, while applications which read data from a Kafka cluster are called consumers. The central component accessed by producers and consumers when processing data streams is a Java library called Kafka Streams. By supporting transactional writing, this ensures that messages are only delivered once (with no duplicates). This is called exactly-once delivery.

Note

The Kafka Streams Java library is the recommended standard solution for processing data in Kafka clusters. However, you can use Apache Kafka with other data stream processing systems as well.

The technical basics: Kafka’s interfaces

This software offers five different basic interfaces to give applications access to Apache Kafka:

Kafka Producer: The Kafka Producer API allows applications to send data streams to the broker(s) in an Apache cluster in order to be categorized and stored (in the previously mentioned topics).
Kafka Consumer: The Kafka Consumer API gives Apache Kafka consumers read access to data stored in the cluster’s topics.
Kafka Streams: The Kafka Streams API allows an application to act as a stream processor to convert incoming data streams into outgoing data streams.
Kafka Connect: The Kafka Connect API makes it possible to build reusable producers and consumers which connect Kafka topics with existing applications or database systems.
Kafka AdminClient: The Kafka AdminClient API makes it possible to easily manage and inspect Kafka clusters.

Communication between client applications and individual servers in Apache clusters occurs via a simple, powerful and language-independent protocol based on TCP. The developers provide a Java client for Apache Kafka by default, but there are also clients in a variety of other languages such as PHP, Python, C/C++, Ruby, Perl and Go.

Use case scenarios for Apache Kafka

From the outset, Apache Kafka was designed for high read and write throughput. When combined with the previously mentioned APIs and its high flexibility, scalability and fault tolerance, this makes the open source software appealing for a variety of use cases. Apache Kafka is particularly well-suited for the following applications:

Publishing and subscribing to data streams: This open source project started out by using Apache Kafka as a messaging system. Although the software’s functions have been extended, it is still best suited for direct message transmission via the queuing system as well as for broadcast message transmission.
Processing data streams: Apache Kafka is a powerful tool for applications which need to react in real time to specific events and which need to process data streams as quickly and effectively as possible for this purpose.
Storing data streams: Apache Kafka can also be used as a fault-tolerant, distributed storage system, whether it’s 50 kilobytes or 50 terabytes of consistent data that you need to store on the server(s).

Naturally, all these elements can be combined as desired, allowing Apache Kafka, as a complex streaming platform, to not only store data and make it available at any time, but also to process it in real time and link it to all desired applications and systems.

An overview of common use cases for Apache Kafka:

Messaging system
Web analytics
Storage system
Data stream processor
Event sourcing
Log file analysis and management
Monitoring solutions
Transaction log

Reviewers

Sven Ignor
Sven Ignor is a TYPO3 web developer with over 15 years of experience and specialises in bespoke solutions based on TYPO3. He is happy to share his knowledge and is committed to TYPO3 and the community
Christian Heldmaier
Christian Heldmaier is an experienced online marketing and SEO specialist from Karlsruhe. He has been working as an SEO Manager at IONOS since July 2020.

10 Years Digital Guide: A Success Story

Apache Kafka Tutorial

The Scala-based streaming and messaging software Apache Kafka is one of the most popular solutions for efficiently storing and processing large data streams. In this Kafka tutorial, you will learn the requirements for using this open source software and how best to install and…

Apache
Tutorials

bluebayShutterstock

Apache Hadoop

Are you looking to execute complicated computing processes with large amounts of data? This is exactly what the Big Data framework, Hadoop, focuses on. The open source Apache software offers a Java based frame with which various Big Data applications on computer clusters can be…

Apache
Linux

Rawpixel.comShutterstock

Apache Lucene Tutorial

Who wouldn’t want to build their own search engine that’s adapted precisely to their requirements? Apache Lucene makes this possible. The open source project can be precisely adjusted and also works extremely quickly, which is why even large companies such as Twitter rely on…

Apache

Virrage Imagesshutterstock

Installing and configuring an Apache web server

The Apache http server is the standard web server for providing HTTP documents on the internet. But did you know it’s also possible to install an apache web server locally on your Windows PC? This is a great method of testing out web pages and checking scripts. To get started,…

Apache
Tutorials

Timofeev Vladimirshutterstock

NGINX vs. Apache

Apache vs. NGINX – while one is said to be bulky and unwieldy, the other is considered slim and high-performance. But can we be this general? In fact, both web servers are based on fundamentally different concepts when it comes to connection management, interpretation of client…

Apache
Digitalization
HTTP