Why Kafka ?
Its because it helps in decoupling of Source Streams and Target Systems:
data:image/s3,"s3://crabby-images/bc270/bc27061d9a67309556894c86110c982dfe2787e9" alt=""
If we can rephrase the above diagram:
data:image/s3,"s3://crabby-images/59e18/59e1867470ea210b6fa5b9d7d38d31921b659034" alt=""
- Its created by LinkedIn, and later became an Open Source project and being maintained by Confluent, Cloudera, IBM.
- Horizontal Scalability:
- can scale to 100s of Brokers.
- Distributed, resilient architecture, fault tolerant.
- High performance (latency of less than 10 ms).
Kafka Topic:
- It is required to categorize different kind of messages, eg., Booking, Payments, Order, etc.
- Similar to a table in database (without all constraints)
- Topics are split into Partitions.
- You can not query topics. You have to use Kafka Consumers to read data and Kafka Producers to send data.
Kafka Partitions:
- Its like a Queue.
- Each partition is ordered. Order is guaranteed within a partition only.
- Each message in a partition get an incremental id, called Offset.
- Data is kept for limited amount of time (1 week by default-configurable).
- Data on a partition is immutable(you can’t change it).
- Data is assigned to a partition randomly unless an key is provided.
data:image/s3,"s3://crabby-images/06fca/06fca456f5300e9b79978f91028f492b80e9192a" alt=""
Kafka Brokers:
- An Kafka server is also called as an broker.
- A Kafka Cluster is composed of multiple brokers.
- The ID of a broker can be only Integer.
- Each broker contains certain topic partitions but not all partitions of a topic, because kafka is distributed.
- After connecting to any broker (called a bootstrap broker) you can get connected to rest all brokers in that cluster.
data:image/s3,"s3://crabby-images/8edd1/8edd12f7703f00efd4961b550832723d140b1496" alt=""
Kafka Broker Discovery:
- Every Kafka broker can be a bootstrap server.
- Once you get hold of a single broker, you can get the details (metadata) of other brokers.
data:image/s3,"s3://crabby-images/bf1b8/bf1b824e4d60b10dc1ed326762475368c5d13786" alt=""
Kafka Topic Replication Factor:
- Replication factor means how many servers(brokers) you want your partitions to be replicated (copied).
data:image/s3,"s3://crabby-images/c63f0/c63f0fbffe0fe4b3d8f66aaace747498fded1500" alt=""
Concept of Leader in a Partition:
- At a given time only one broker can be a leader for a partition.
- Only the leader will receive and send data for a partition.
- The other brokers for that partition will only synchronize the data.
- So each partition has one leader and multiple ISR (In Sync Replicas).
- Zookeeper decides that who will be a Leader and who will be ISRs.
data:image/s3,"s3://crabby-images/d25eb/d25eb939b99df41928d92dea697eb3afae4dbfbc" alt=""
Zookeeper:
- Zookeeper manages kafka brokers (keeps a list of them)
- Zookeeper helps in choosing leader in partitions.
- It sends notifications to Kafka brokers whenever any changes happen (like if a new topic is created, or a topic is deleted, or a broker is created, or broker is down)
- Kafka 2.x can not work without Zookeeper.
- Kafka 3.x can work without Zookeeper. For backward compatibility they have kept “Kafka Raft” instead.
- Kafka 4.x will not have Zookeeper.
- The reason of removing Zookeeper is due to its scaling issues if the Kafka clusters have more than 1,00,000 partitions.
- Zookeepers by design operate with an odd number of servers (1, 3, 5, 7) and never more than 7.
data:image/s3,"s3://crabby-images/a1e75/a1e75ed50db0a1bc84eccaa6ce051508ce670fe8" alt=""
Should we use Kafka Zookeeper ?
- If you are managing Kafka brokers, and its until Kafka 4.0, yes, you have to use it.
Kafka Producer:
- Producer write data to topics (which is made up of partitions).
- Producer automatically knows which topics to write to.
- In case of a Broker failure, producer automatically recovers. They are programmed in such a way.
- Producer can decide whether it want the acknowledgment of data writes to topic:
acks=0 : Producer will not wait for acknowledgment (possible data loss)
acks=1 : default, Producer will wait for leader acknowledgment (limited data loss)
acks=all : leader + all replicas acknowledgment (no data loss)
data:image/s3,"s3://crabby-images/49eb4/49eb4cc7d77bf97106237c3ffafb1edc3a28a007" alt=""
Producer Message Key :
- Producer can choose to send a key with the message. The key can be string, number, binary, etc.
- If key=null, then message is send in a round robin fashion. (broker 101, then 102, 103…)
- If a key is send then all the messages are send to that particular partition.
data:image/s3,"s3://crabby-images/0d0ad/0d0ada70172b19caa83e1921c3115bd00e0cc6f4" alt=""
Anatomy of a Kafka Message:
data:image/s3,"s3://crabby-images/839be/839be3c529744c58c752ae284cfdc8ac0deb9809" alt=""
Kafka Message Serializer:
Kafka only accepts bytes as input from Producers and send bytes as output to a Consumer. That is why we have to serialize the objects/data into bytes.
Kafka Message Key Hashing:
Key Hashing is the process of mapping of a key to a partition.
In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.
Kafka Consumer:
- Consumers read data from topic – pull model (Its not that Producers push messages to consumers).
- Consumers know which broker to read from (just like producers).
- In case of broker failures, consumers know how to recover (just like producers).
- Data is read in order within each partitions.
- One consumer can pull data from multiple partitions as well:
data:image/s3,"s3://crabby-images/51292/51292410e7fc2a3e1bbcd87d40b3c396281a92e2" alt=""
Consumer Deserialization:
In this step the data which is in bytes has to be converted back to objects.
Consumer Groups:
- Its a group of consumers.
- In a Consumer Group a consumer can read data from multiple partitions, but a partition can not be consumed by more than one consumer from a same group.
data:image/s3,"s3://crabby-images/48686/486867e20a81d66ce48ebf4f0a789613a20ca94c" alt=""
- If you have more consumer than partitions, then some consumers have to be inactive- since a partitions can’t be shared by multiple consumers.
data:image/s3,"s3://crabby-images/f65a8/f65a8873dff63055bd77fe9a8d37677786702c43" alt=""
- we can have multiple consumer groups reading from the same partitions.
data:image/s3,"s3://crabby-images/61559/615595d40312b8575d851a3c3f26617309434968" alt=""
Consumer Offsets:
- Kafka stores the offsets at which a consumer group has been reading.
- The committed offsets at in a topic named __consumer_offsets
(the 2 underscore __ means its an internal kafka topic)