Why Kafka ?
Its because it helps in decoupling of Source Streams and Target Systems:
If we can rephrase the above diagram:
- Its created by LinkedIn, and later became an Open Source project and being maintained by Confluent, Cloudera, IBM.
- Horizontal Scalability:
- can scale to 100s of Brokers.
- Distributed, resilient architecture, fault tolerant.
- High performance (latency of less than 10 ms).
Kafka Topic:
- It is required to categorize different kind of messages, eg., Booking, Payments, Order, etc.
- Similar to a table in database (without all constraints)
- Topics are split into Partitions.
- You can not query topics. You have to use Kafka Consumers to read data and Kafka Producers to send data.
Kafka Partitions:
- Its like a Queue.
- Each partition is ordered. Order is guaranteed within a partition only.
- Each message in a partition get an incremental id, called Offset.
- Data is kept for limited amount of time (1 week by default-configurable).
- Data on a partition is immutable(you can’t change it).
- Data is assigned to a partition randomly unless an key is provided.
Kafka Brokers:
- An Kafka server is also called as an broker.
- A Kafka Cluster is composed of multiple brokers.
- The ID of a broker can be only Integer.
- Each broker contains certain topic partitions but not all partitions of a topic, because kafka is distributed.
- After connecting to any broker (called a bootstrap broker) you can get connected to rest all brokers in that cluster.
Kafka Broker Discovery:
- Every Kafka broker can be a bootstrap server.
- Once you get hold of a single broker, you can get the details (metadata) of other brokers.
Kafka Topic Replication Factor:
- Replication factor means how many servers(brokers) you want your partitions to be replicated (copied).
Concept of Leader in a Partition:
- At a given time only one broker can be a leader for a partition.
- Only the leader will receive and send data for a partition.
- The other brokers for that partition will only synchronize the data.
- So each partition has one leader and multiple ISR (In Sync Replicas).
- Zookeeper decides that who will be a Leader and who will be ISRs.
Zookeeper:
- Zookeeper manages kafka brokers (keeps a list of them)
- Zookeeper helps in choosing leader in partitions.
- It sends notifications to Kafka brokers whenever any changes happen (like if a new topic is created, or a topic is deleted, or a broker is created, or broker is down)
- Kafka 2.x can not work without Zookeeper.
- Kafka 3.x can work without Zookeeper. For backward compatibility they have kept “Kafka Raft” instead.
- Kafka 4.x will not have Zookeeper.
- The reason of removing Zookeeper is due to its scaling issues if the Kafka clusters have more than 1,00,000 partitions.
- Zookeepers by design operate with an odd number of servers (1, 3, 5, 7) and never more than 7.
Should we use Kafka Zookeeper ?
- If you are managing Kafka brokers, and its until Kafka 4.0, yes, you have to use it.
Kafka Producer:
- Producer write data to topics (which is made up of partitions).
- Producer automatically knows which topics to write to.
- In case of a Broker failure, producer automatically recovers. They are programmed in such a way.
- Producer can decide whether it want the acknowledgment of data writes to topic:
acks=0 : Producer will not wait for acknowledgment (possible data loss)
acks=1 : default, Producer will wait for leader acknowledgment (limited data loss)
acks=all : leader + all replicas acknowledgment (no data loss)
Producer Message Key :
- Producer can choose to send a key with the message. The key can be string, number, binary, etc.
- If key=null, then message is send in a round robin fashion. (broker 101, then 102, 103…)
- If a key is send then all the messages are send to that particular partition.
Anatomy of a Kafka Message:
Kafka Message Serializer:
Kafka only accepts bytes as input from Producers and send bytes as output to a Consumer. That is why we have to serialize the objects/data into bytes.
Kafka Message Key Hashing:
Key Hashing is the process of mapping of a key to a partition.
In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.
Kafka Consumer:
- Consumers read data from topic – pull model (Its not that Producers push messages to consumers).
- Consumers know which broker to read from (just like producers).
- In case of broker failures, consumers know how to recover (just like producers).
- Data is read in order within each partitions.
- One consumer can pull data from multiple partitions as well:
Consumer Deserialization:
In this step the data which is in bytes has to be converted back to objects.
Consumer Groups:
- Its a group of consumers.
- In a Consumer Group a consumer can read data from multiple partitions, but a partition can not be consumed by more than one consumer from a same group.
- If you have more consumer than partitions, then some consumers have to be inactive- since a partitions can’t be shared by multiple consumers.
- we can have multiple consumer groups reading from the same partitions.
Consumer Offsets:
- Kafka stores the offsets at which a consumer group has been reading.
- The committed offsets at in a topic named __consumer_offsets
(the 2 underscore __ means its an internal kafka topic)