Unlocking Real-Time Data with Change Data Capture (CDC)
In this guide, we will cover CDC, its importance, and the setup of a CDC stack using Kafka, Debezium, and other services. Additionally, we will configure a PostgreSQL connector using the Confluent Control Center web UI to capture changes from a PostgreSQL database and publish them to Kafka.
Understanding Change Data Capture (CDC) in Data Engineering
Change Data Capture (CDC) is a design pattern used in data engineering to track changes in databases in real-time and stream these changes to other systems, such as data lakes, data warehouses, or microservices. This technology plays a pivotal role in distributed data systems, where real-time replication of data and events is necessary.
What is CDC?
CDC monitors and records changes to a database and delivers them to a downstream service (e.g., Kafka or a data warehouse). These changes can include data inserts, updates, and deletes. The main idea behind CDC is to replicate the state of a source database by capturing and broadcasting these changes, enabling synchronisation between systems without having to rely on bulk imports or other traditional ETL processes.
In a typical scenario, a CDC tool captures changes and streams them in real-time, allowing other systems to process and respond instantly, ensuring that consumers of the data remain updated with the latest changes from the source database.
Why is CDC Important in Data Engineering?
Real-Time Data Replication: CDC enables the synchronization of changes in real-time between systems. This is vital for applications like real-time analytics, fraud detection, and operational intelligence.
Avoids Full Data Loads: Instead of loading entire datasets periodically, CDC captures and processes only the deltas, saving bandwidth and system resources.
Decoupling of Systems: CDC allows for decoupling source and target systems, enabling them to evolve independently while maintaining data synchronization.
Event-Driven Architectures: By leveraging CDC, organizations can implement event-driven architectures where changes in the data trigger specific actions across multiple systems.
Benefits of CDC
Reduced Latency: Data is available in near real-time in downstream systems, improving responsiveness for analytics or decision-making processes.
Scalability: CDC efficiently scales with high-volume transactions since it processes only changes instead of full datasets.
Data Consistency: It ensures data consistency across distributed systems without locking databases for batch ETL operations.
Better System Performance: CDC does not impact the performance of the source system since it only captures incremental changes.
Prerequisites for Deploying the CDC Stack
To set up this CDC stack, you need the following prerequisites:
1. Docker and Docker Compose
We will be using Docker to containerise the services required for this CDC stack (Kafka, Debezium, Zookeeper, etc.). Docker helps in packaging these services along with their dependencies into isolated, lightweight containers, making it easier to deploy and manage. This allows us to quickly set up, tear down, and redeploy the CDC stack across various environments without worrying about dependency conflicts or installation issues.
Install Docker: Follow the official Docker installation guides for Linux, Mac, or Windows.
Docker Compose: You’ll use Docker Compose to start and manage the services like Kafka, Debezium, and Zookeeper in a single command.
2. Basic Understanding of Kafka and Docker
- Familiarity with Docker containers and Kafka event streaming is recommended.
Docker-Compose CDC Stack Walkthrough
The provided Docker-Compose file sets up a CDC (Change Data Capture) stack using Kafka, Debezium, Schema Registry, and Control Center. Each service has a distinct role in enabling the capture, processing, and monitoring of data changes.
1. Zookeeper
Service:
zookeeper
Purpose: Zookeeper manages distributed services and coordinates between Kafka brokers. Kafka relies on Zookeeper for configuration management, leader election, and other cluster coordination tasks.
Why is it used?: Zookeeper helps Kafka maintain state across a distributed cluster, providing a reliable way for Kafka brokers to store metadata and configurations.
ports:
- "2181:2181" # Zookeeper uses port 2181 for client communication
2. Kafka Broker
Service:
kafka
Purpose: Kafka is a distributed event streaming platform used for building real-time data pipelines. In the CDC stack, Kafka receives data changes from Debezium and stores them as topics, which can then be consumed by downstream services.
Why is it used?: Kafka is essential for publishing the changes captured from the source database. It provides fault tolerance and scalability for data pipelines.
ports:
- "9092:9092" # Standard Kafka broker port
3. Debezium
Service:
debezium
Purpose: Debezium is a CDC tool that captures real-time changes from databases (like MySQL, PostgreSQL, etc.) and publishes them to Kafka topics. Debezium acts as a Kafka Connect module, monitoring database changes and streaming them to Kafka.
Why is it used?: Debezium allows capturing and streaming row-level changes from the database. It integrates well with Kafka and enables the real-time syncing of databases without complex ETL scripts.
depends_on: [kafka, schema-registry] # Ensures Kafka and Schema Registry are running before starting Debezium
4. Schema Registry
Service:
schema-registry
Purpose: The Confluent Schema Registry stores and retrieves schemas for Kafka messages, ensuring that the data format remains consistent across producers and consumers. It supports Avro, JSON Schema, and Protobuf serialization.
Why is it used?: In a CDC stack, maintaining the structure of the data (schema) is crucial. The Schema Registry allows for the management of evolving data structures, ensuring backward and forward compatibility.
ports:
- "8081:8081" # Schema Registry uses port 8081
5. Control Center
Service:
control-center
Purpose: The Confluent Control Center provides a web UI for monitoring and managing Kafka clusters, connectors, and topics. It offers features like monitoring the health of Kafka connectors, viewing topic metrics, and schema evolution.
Why is it used?: Control Center gives a visual interface to monitor Kafka, Debezium, and Schema Registry activities. It simplifies the management and monitoring of the CDC stack, making it easier to understand the flow of data.
ports:
- "9021:9021" # Control Center runs on port 9021
Deploy the services using Docker Compose
Naviagate to the directory containing the docker-compose.yaml and excute the below command to start the services configured in our docker-compose file.
docker-compose up -d
To check if our containers are up and running you can execute the below command which will list all the running containers.
docker ps
Also, one can take a look at Docker Desktop UI
Setting Up the PostgresSQL database
Once the CDC stack is deployed, the next step is to setup a postgres instance, feel free to either start a standalone Postgres instance on your local machine or any on cloud like AWS RDS or spin up a docker container for Postgres db, this will be a DIY task.
Next Steps:
Configure PostgreSQL Database:
Ensure you have a running PostgreSQL instance, either locally or in a container. The PostgreSQL instance should have logical replication enabled.
In PostgreSQL, the
wal_level
needs to be set tological
in order to support Change Data Capture (CDC) with tools like Debezium. Thewal_level
setting controls how much information is written to the Write-Ahead Log (WAL), andlogical
is required for logical replication, which is how CDC captures changes in the database.Modify the
postgresql.conf
file to include:wal_level = logical
If using docker container one can set the wal_level via the environment variables:
docker run -d \ --name my-postgres \ -e POSTGRES_USER=myuser \ -e POSTGRES_PASSWORD=mypassword \ -e POSTGRES_DB=mydatabase \ -e POSTGRES_INITDB_ARGS="--wal_level=logical --max_replication_slots=4 --max_wal_senders=4" \ -p 5432:5432 \ postgres:latest
Ensure the changes were applied:
Run the below query in your Postgres db :
select * from pg_settings where name='wal_level'
Deploying the PostgreSQL Connector
Once you have your CDC stack and Postgres db both up and running with all required changes, next we will deploy the PostgreSQL Connector.
Steps to Configure the PostgreSQL Connector:
Open Confluent Control Center:
- Open your web browser and navigate to
http://localhost:9021
to access the Control Center.
- Open your web browser and navigate to
Navigate to the Connect UI:
Click on the "Connect" tab in the Control Center's navigation then click on your listed connect instance (connect-default). This UI will allow you to configure Kafka Connect connectors.
Add a New Connector:
Click on "Add Connector", then click on “Upload Connector Config File”
Connector config file:
{ "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "tasks.max": "1", "database.hostname": "{HOSTNAME}", "database.port": "5432", "database.user": "{USER}", "database.password": "{PASSWORD}", "database.dbname": "{DATABASE_NAME}", "database.server.name": "{UNIQUE_INDENTIFIER_NAME}", "plugin.name": "pgoutput", "topic.prefix": "{PREFIX}", "table.include.list": "public.{tablename1,tablename2,...}", "key.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "slot.name": "{MEANINGFULL_SLOT_NAME}" }
Once the file upload is complete you should a a screen the one below:
Click Launch !!
We are ready to Capture changes from our PostgreSQL Database.
Monitoring the CDC Pipeline
The Confluent Control Center provides a convenient way to monitor your Kafka-based CDC pipeline. You can track the status of your connectors, monitor Kafka topics, and view detailed metrics.
Connector Health: You can check the status of the PostgreSQL connector under the "Connect" section.
Kafka Topics: View and monitor the topics that Debezium writes to in the "Topics" section.
Real-Time Metrics: Get insights into consumer activity, lag, and producer performance in the Control Center UI.
Make changes to the monitored table
In this tutorial table “transactions” under public schema inisde the test-db is being monitored for changes.
Let’s update an existing record with “txn_id” = 2 and “cust_id” = 9, set the attribute “txn_status” to “FAILED”.
Once the record is updated successfully, navigate to the control center webui ad under topics section find the topic with name of the table (“transactions“ here).
Under the messages tab you will see new message published to this topic, we see here the attribute “txn_status” is set to “FAILED” for the record we updated and at message line number 23 the operations value is “u” indicating an UPDATE operation being performed on this record.
Similarly “c” indicates records creation and “d” for deletion.
Let’s delete the record where the transaction failed.
A new change was captured and published to our kafka topic, operation in this message indicates a deletion of record where the primary key which is “txn_id“ is 2, hence all other attributes are set to 0.
As you've now learned about Change Data Capture (CDC) and successfully set up a CDC stack using Kafka, Debezium, and PostgreSQL, you're well-equipped to explore further. You can now experiment with different operations like inserts, updates, and deletes on your database to see how they propagate to Kafka in real-time. Additionally, feel free to adjust various parameter values in your CDC stack, such as replication slots, snapshot modes, and Kafka topic configurations, to fine-tune the performance and behavior of your pipeline.
Conclusion
Change Data Capture (CDC) is a vital pattern for enabling real-time data pipelines in modern architectures. By capturing and replicating changes from databases to event streaming platforms like Kafka, CDC facilitates efficient data movement without burdening the source system. The Docker-Compose file showcases a practical CDC stack composed of Zookeeper, Kafka, Debezium, Schema Registry, and Control Center, each playing an essential role in ensuring a seamless and scalable CDC implementation.
With this foundational knowledge, you can start creating your own CDC pipelines tailored to specific use cases—whether it’s for analytics, data synchronisation, or powering downstream applications like Apache Spark, Flink, Elasticsearch, and data lakes. The flexibility and power of CDC make it an essential tool for modern data engineering and event-driven architectures. Happy experimenting!