What Is Kafka Used For?
Kafka is a distributed streaming platform. It is a distributed streaming platform is able to complete three functions.
- Similar to messaging queues, or enterprise messaging systems, it is able to publish and subscribe to streams of records
- Streams of records can be stored in a durable, fault-tolerant method
- Streams of records can be processed as they occur
Kafka is most commonly used with two main applications.
- It builds real-time streaming data pipelines, which are reliably able to get data between systems of applications
- It also builds real-time streaming applications that are able to transform or to react to data streams.
How does Kafka complete these processes?
What can Kafka be used for? or What is Kafka used for?
Basic Concepts of Kafka
- Kafka can span multiple data centres on one or more servers, because it is run as a cluster
- Kafka clusters is able to store streams of records in ‘topics’ (categories)
- A key, value and a timestamp
Kafka and Application Programming Interfaces (APIs)
There are four main APIs in Kafka
- Producer API – an application is able to publish a record stream to the Kafka ‘topics’
- Consumer API – an application which subscribes to one or more topics, and is able to process the record stream produced to them
- Streams API – this application allows an application to behave as a stream processor, which is able to effectively transform input streams to output streams
- Connector API – this application allows building, and also running reusable producers or consumers, that will connect topics from Kafka to applications or data systems that are already existing
A simple, high performance, language agnostic is used to communicate between clients and the servers.
There are versions available and they maintain compatibility with earlier versions.
A Java client is provided for Kafka, but many languages are available.
Records are published to a category or a feed name, which is called a ‘topic’.
In Kafka ‘topics’ can always have multiple subscribers.
The Kafka cluster maintains a partitional log for each topic.
Within each partition there is an ordered sequence of records, which is continually being added to. Each record in the partition is assigned a unique and sequential id number, called an ‘offset’.
The cluster uses a configurable retention period and durably persists through all published records, no matter if they have been consumed or not. The retention policy can be set for a specific time, after which it is discarded to free up more space. Storing data for a long time is not an issue.
The only metadata that is retained is the offset, or the position of the consumer. The consumer is able to control the offset.
The combination of these features means that Kafka consumers are cheap – they can move around with little impact on the cluster or on other consumers.
There are several purposes for the partitions in the log..
- The log is able to scale beyond a size that fits onto a single server and because a topic can have many partitions it can handle an arbitrary amount of time
- They also act as a unit of parallelism.
The partitions within the log are distributed in the Kafka cluster, across the servers – each server handles data and requests of a share of partitions. For fault tolerance, each partition is copied across a configurable number.
Within each partition there is one server acting as a ‘leader’, and zero or more servers who are ‘followers’. The role of the leader is to handle all read and write requests, and followers replicate what the leader does.
Using ‘MirrorMaker’, Kafka messages are copied across multiple data centres or cloud regions. For backup and recovery, it is used in active/passive scenarios, or in active/active scenarios, which will place data closer to your users.
The role of the producer is to decide which record should be assigned to which partition within the topic. It can be completed in a round-robin method, or according to a semantic partition function.
Consumers are labelled with a consumer group name and every published record is delivered to one consumer instance within each subscribing group. Instances can be in process or on machines.
Records can be load balanced over consumer instances if they all have the same consumer group.
However, if they have different consumer groups, then the records will be broadcast to all consumer processes.
Topics, generally have a smaller number of consumer groups, one per ‘logical subscriber’.
Consumption is implemented in Kafka by a division of the partitions in the log over each consumer instance. At any point in time each instance is the exclusive consumer of a ‘fair share’ of partitions.
The maintenance of group membership is handled dynamically.
It provided a total order over records that are within a partition, and not between partitions.
Kafka can also be used as a multi-tenant solution. It is enabled by configuring topics that can consume or produce data. Quotas can also be supported by administrators.
Kafka offers the following guarantees
- Messages sent to a topic partition will be added in the order in which they were sent
- The consumer instance will see the records in the order they are stored
- Replication factor N will be tolerated up to N-1 server failures, with no loss of any records
Message Broker Kafka Overview
Messaging systems are generally one of two main models, queuing or publish-subscribe.
Queuing – a group of consumers can read from a server, and each record will go to one of them. Queuing allows the user to divide the processing of data over multiple consumer instances, but once read the data is gone.
Publish-subscribe – a record is broadcast to all consumers. Although it allows the data to be broadcast to many processors, it is not able to scale the process.
In Kafka, the consumer group generalises both these concepts. One distinct advantage of Kafka is that each topic has both important properties. Ordering guarantees are stronger with Kafka than other traditional messaging systems.
Kafka message broker uses the idea of parallelism, rather than a traditional queue. This enables Kafka to provide ordering guarantees a load balancing across consumer processes.
Kafka – a storage system
Kafka is a particularly good storage system. Data from Kafka is written to disk and also replicated for fault-tolerance.
Producers wait for acknowledgment so it is not complete until all activities are completed.
Kafka’s disk structures also scale well. Kafka is like a special purpose file stream, which is dedicated to propagation replication, low-latency lot storage, and above all, high performance.
Kafka Stream Processing
Reading, writing, and storing streams of data is not enough Kafka enables real-time processing of streams.
The Kafka stream processor will use continual streams of data, performs a process on this input and produces a continual stream of data. The producer and consumer APIs allow for simple, direct processing, but Kafka also provides a fully integrated Streams API for more complex transformations. This helps to assist with problems such as handling data that is out of order, etc.
As a streaming platform Kafka combines messaging, storage and stream processing. Kafka allows historical data to be stored and processed.
Kafka allows the processing of future messages that arrive following your subscription.
Kafka is important as a platform for streaming applications and streaming data pipelines.
This means that streaming applications can treat past and future data in the same way.
Stream processing facilities allow data to be transformed as it arrives.
Here is an end of Message Broker Kafka overview. Hopefully you will have understand what is Kafka used for.