Research about Kafka

Here is some research about Kafka features and configurations.

References: https://kafka.apache.org/43/configuration/broker-configs/

1. Architect

Kafka Architect

2. What is brokers ?

Broker: a broker is a server that store Kafka topics, partitions.

2.1. What fields are needed to configure a Kafka broker? ?

Configuration	Purpose
`broker.id`	Unique ID for each broker
`listeners`	Network address/port the broker listens on
`advertised.listeners`	Address clients use to connect
`log.dirs`	Directory where Kafka stores data
`num.partitions`	Default number of partitions for new topics
`default.replication.factor`	Default replication factor
`offsets.topic.replication.factor`	Replication factor for consumer offsets topic
`log.retention.hours`	How long messages are kept
`log.segment.bytes`	Size of each log segment file
`zookeeper.connect`	ZooKeeper connection (older Kafka setups)
`process.roles`	Role in KRaft mode (`broker`, `controller`)
`controller.quorum.voters`	Controllers in KRaft mode
`inter.broker.listener.name`	Listener used between brokers
`auto.create.topics.enable`	Automatically create topics or not
`delete.topic.enable`	Allow topic deletion
`message.max.bytes`	Maximum message size
`replica.fetch.max.bytes`	Max data replicas fetch at once

2.2. What is example of configuration file ?

broker.id=1

listeners=PLAINTEXT://localhost:9092 advertised.listeners=PLAINTEXT://192.168.1.10:9092

log.dirs=/var/lib/kafka/logs

num.partitions=3 default.replication.factor=3

log.retention.hours=168 log.segment.bytes=1073741824

auto.create.topics.enable=true delete.topic.enable=true

message.max.bytes=1048576

2.3. What is important about configurations ?

broker.id

broker.id=1

Unique identifier for each broker
No two brokers should share the same ID

listeners

listeners=PLAINTEXT://0.0.0.0:9092

Defines where Kafka listens for connections.

Format: PROTOCOL://HOST:PORT

Example protocols:

PLAINTEXT
SSL
SASL_PLAINTEXT
SASL_SSL

advertised.listeners

advertised.listeners=PLAINTEXT://kafka.example.com:9092

This is the address sent back to clients.

Very important in:

Docker
Kubernetes
Cloud deployments

If configured incorrectly, clients cannot connect.

log.dirs

log.dirs=/data/kafka-logs

Location where Kafka stores:

topic data
partitions
offsets

num.partitions

num.partitions=3

Default partitions for newly created topics.

More partitions:

higher parallelism
better throughput

But:

more overhead

log.retention.hours

log.retention.hours=168

How long Kafka keeps messages.

Example: 168 = 7 days

After expiration, old logs are deleted.

default.replication.factor

default.replication.factor=3

How many copies of data Kafka keeps.

Example:

replication factor 3
data stored on 3 brokers

Improves fault tolerance.

auto.create.topics.enable

auto.create.topics.enable=false

If true:

Kafka automatically creates missing topics
Production systems often disable this.

message.max.bytes

message.max.bytes=10485760

Maximum message size allowed.

Example:

10485760 = 10 MB

Producer and consumer configs must also match.

KRaft Mode (Modern Kafka)

New Kafka versions can run without ZooKeeper.

Important configs:

process.roles
process.roles=broker,controller

Defines node role:

broker
controller

or both

controller.quorum.voters
controller.quorum.voters=1@node1:9093,2@node2:9093

2.4. What is KRaft Mode and how producer and consumer in Kafka replace it ?

Zookeeper in the past:

Component	Responsibility
Broker	Stores data and handles producer/consumer requests
Controller	Manages cluster metadata and coordination

Broker Responsibilities

Store messages
Handle reads/writes
Manage partitions
Replicate topic data
Serve producers and consumers

Controller

Elect partition leaders
Monitor broker health
Manage cluster metadata
Coordinate replicas
Handle failover

2.5. listeners and advertised.listeners different

listeners defines where the broker listens for connections.
advertised.listeners defines the address shared with clients for connecting to the broker.

3. What is topic ?

Use case: named stream of messages/events.
Can be configure for:
- retention
- partitioning
- replication
- compression
- cleanup behavior
- message size
- performance

4. What is partition ?

What: A partition is a smaller chunk of a topic.

5. What is replication ?

What: Kafka copies partitions to multiple brokers.

6. What is consumer group ?

What: a consumer group is a set of consumers that work together to read a topic.
Same topic: different topics is ok.

7. What is Kafka Connect ?

What: used to move data between Kafka and external systems without writing custom producer/consumer code.
It helps integrate Kafka with:
- databases
- data warehouses
- search engines
- cloud storage
- message queues
- SaaS tools
Visual: Traditional vs Kafka Connect

Traditional Kafka Consumer

Kafka Connect Approach

8. What is Kafka Stream ?

What: to build applications and microservices that process, analyze, and transform data stored in Apache Kafka in real-time
It lets you:
- read data from Kafka topics
- process/transform data
- write results to another topic

Kafka Stream

9. Consumer and number of paritions of a topic

Kafka Partitions and Consumer Group

10. Admin Configs in Kafka

What: In Apache Kafka, Admin Configs are configuration settings used by Kafka administrative tools and clients to manage the cluster.
They are mainly used with:
- AdminClient
- brokers
- topics
- partitions
- ACLs
- consumer groups
- cluster operations

11. MirrorMaker Configs

What: In Apache Kafka, MirrorMaker (especially MirrorMaker 2 / MM2) is used for replicating data between Kafka clusters.
Example: Cluster A -> Cluster B

12. System Properties in Kafka

What: System properties = JVM-level or broker startup properties used to control how Kafka runs at the process level.
Example:
- KAFKA_HEAP_OPTS=”-Xmx2G -Xms2G”
- KAFKA_OPTS=”-Djava.security.auth.login.config=/etc/kafka/jaas.conf”

13. Tiered Storage Configs

What: Tiered Storage in Apache Kafka is used to separate compute from storage by dividing data into local and remote layers.

14. Configuration Providers

What: Configuration Providers let Kafka fetch config values dynamically from external systems instead of hardcoding them.

15. Kafka Producer

Ensure message in order: fire in the same paritition.
Decouple 1 service into consumer and producer ⇒ by split it into producers and consumers and scale independently.
Producer

ACK = 0: fire and forget
ACK = 1: only when the first partition receive messages, do not wait for this sync to all the partitions.
ACK = N: wait until it sync message to all partitions.
Using ProducerID + sequence number ⇒ So that when the producer send duplicate message to Kafka but old sequenc, Kafka can reject to receive it.

16. Kafka Consumer

Idea: consume message by pull-based model
In consumer group:
- 1 Partition responsible by 1 consumer.
- if 1 consumer downtime (can not health check) → Parition to assign to another consumer.
At most once
- Commit when receive message
- Can be processed failed.
At least once
- Commit after process message.
- Can be duplicate
Notes: Handle by controling the processing ⇒ Add 1 transaction id hoặc message id in database -> handle duplicate downstream.

June 19, 2026

Nguyễn Đức An

Software Engineer Skills

How to Deal with Uncertain Problems

Grill weaknesses

Principles

My expectations 2026