Apache Kafka -

Why Kafka ?

Its because it helps in decoupling of Source Streams and Target Systems:

If we can rephrase the above diagram:

Its created by LinkedIn, and later became an Open Source project and being maintained by Confluent, Cloudera, IBM.
Horizontal Scalability:
- can scale to 100s of Brokers.
Distributed, resilient architecture, fault tolerant.
High performance (latency of less than 10 ms).

Kafka Topic:

It is required to categorize different kind of messages, eg., Booking, Payments, Order, etc.
Similar to a table in database (without all constraints)
Topics are split into Partitions.
You can not query topics. You have to use Kafka Consumers to read data and Kafka Producers to send data.

Kafka Partitions:

Its like a Queue.
Each partition is ordered. Order is guaranteed within a partition only.
Each message in a partition get an incremental id, called Offset.
Data is kept for limited amount of time (1 week by default-configurable).
Data on a partition is immutable(you can’t change it).
Data is assigned to a partition randomly unless an key is provided.

Kafka Brokers:

An Kafka server is also called as an broker.
A Kafka Cluster is composed of multiple brokers.
The ID of a broker can be only Integer.
Each broker contains certain topic partitions but not all partitions of a topic, because kafka is distributed.
After connecting to any broker (called a bootstrap broker) you can get connected to rest all brokers in that cluster.

Kafka Broker Discovery:

Every Kafka broker can be a bootstrap server.
Once you get hold of a single broker, you can get the details (metadata) of other brokers.

Kafka Topic Replication Factor:

Replication factor means how many servers(brokers) you want your partitions to be replicated (copied).

Concept of Leader in a Partition:

At a given time only one broker can be a leader for a partition.
Only the leader will receive and send data for a partition.
The other brokers for that partition will only synchronize the data.
So each partition has one leader and multiple ISR (In Sync Replicas).
Zookeeper decides that who will be a Leader and who will be ISRs.

Zookeeper:

Zookeeper manages kafka brokers (keeps a list of them)
Zookeeper helps in choosing leader in partitions.
It sends notifications to Kafka brokers whenever any changes happen (like if a new topic is created, or a topic is deleted, or a broker is created, or broker is down)
Kafka 2.x can not work without Zookeeper.
Kafka 3.x can work without Zookeeper. For backward compatibility they have kept “Kafka Raft” instead.
Kafka 4.x will not have Zookeeper.
The reason of removing Zookeeper is due to its scaling issues if the Kafka clusters have more than 1,00,000 partitions.
Zookeepers by design operate with an odd number of servers (1, 3, 5, 7) and never more than 7.

Should we use Kafka Zookeeper ?

If you are managing Kafka brokers, and its until Kafka 4.0, yes, you have to use it.

Kafka Producer:

Producer write data to topics (which is made up of partitions).
Producer automatically knows which topics to write to.
In case of a Broker failure, producer automatically recovers. They are programmed in such a way.
Producer can decide whether it want the acknowledgment of data writes to topic:
acks=0 : Producer will not wait for acknowledgment (possible data loss)
acks=1 : default, Producer will wait for leader acknowledgment (limited data loss)
acks=all : leader + all replicas acknowledgment (no data loss)

Producer Message Key :

Producer can choose to send a key with the message. The key can be string, number, binary, etc.
If key=null, then message is send in a round robin fashion. (broker 101, then 102, 103…)
If a key is send then all the messages are send to that particular partition.

Anatomy of a Kafka Message:

Kafka Message Serializer:

Kafka only accepts bytes as input from Producers and send bytes as output to a Consumer. That is why we have to serialize the objects/data into bytes.

Kafka Message Key Hashing:

Key Hashing is the process of mapping of a key to a partition.
In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.

Kafka Consumer:

Consumers read data from topic – pull model (Its not that Producers push messages to consumers).
Consumers know which broker to read from (just like producers).
In case of broker failures, consumers know how to recover (just like producers).
Data is read in order within each partitions.
One consumer can pull data from multiple partitions as well:

Consumer Deserialization:

In this step the data which is in bytes has to be converted back to objects.

Consumer Groups:

Its a group of consumers.
In a Consumer Group a consumer can read data from multiple partitions, but a partition can not be consumed by more than one consumer from a same group.

If you have more consumer than partitions, then some consumers have to be inactive- since a partitions can’t be shared by multiple consumers.

we can have multiple consumer groups reading from the same partitions.

Consumer Offsets:

Kafka stores the offsets at which a consumer group has been reading.
The committed offsets at in a topic named __consumer_offsets
(the 2 underscore __ means its an internal kafka topic)