Change Data Capture (CDC): What it is and How it Works
What is Change Data Capture?
Change Data Capture is a software process that
identifies and tracks changes to data in a database. CDC provides real-time or
near-real-time movement of data by moving and processing data continuously as
new database events occur.
In high-velocity data environments where time-sensitive
decisions are made, Change Data Capture is an excellent fit to achieve
low-latency, reliable, and scalable data replication. Change Data Capture
is also ideal for zero-downtime migrations to the cloud.
With over 80% of companies planning on
implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more
critical than ever given the need to replicate data across multiple
environments.
SQL Server offers a built-in CDC functionality, as depicted in the diagram below.
The source table contains the data
that’s being inserted, updated, and deleted by applications. The SQL
Server transaction log keeps a record of the changes. If CDC is
enabled, a SQL Server agent reads inserts, updates, and
deletes from the transaction log. These changes are added to a separate change
table, where query functions are used to capture the changes so
they can be delivered to a target data warehouse or other destination via a
data integration process like ETL.
What are the benefits and drawbacks of the built-in SQL
Server Change Data Capture feature?
Pros:
- Built-in feature, no external
tools needed
- Uses familiar T-SQL language
Cons:
- Adding a change table adds
overhead to the source database
- Querying change tables adds
latency
- Change tables need to be cleaned
up on a regular basis
- Since a change table is created
for each CDC-enabled table, it can be challenging to consolidate changes
from multiple tables (and databases)
Note: See Monitoring changes in SQL Server using
change data capture for more information about CDC.
The additional columns include
- __$start_lsn and __$end_lsn that
show the commit log sequence number (LSN) assigned by the SQL Server
Engine to the recorded change
- __$seqval that shows the order of that change related
to other changes in the same transaction,
- __$operation that shows
the operation type of the change, where 1 = delete, 2 = insert, 3 = update
(before change), and 4 = update (after change)
- __$update_mask that is a bit mask defined for each captured column, identifying the updating columns
This detailed information makes it easier to monitor the database changes for security or auditing purposes, or incrementally load these changes from the OLTP source to the target OLAP data warehouse, using T-SQL or ETL methods.
How do you configure Change Data Capture in SQL Server?
CDC must be enabled both at the database and table
level in SQL Server. Database-level CDC functionality is enabled by
setting is_cdc_enabled to 1 in the sys.databases table.
Invoke the following commands to check the current
state of database-level CDC functionality:
SELECT
database_id,
name,
is_cdc_enabled
FROM sys.databases
GO
Invoke the following commands to enable database-level
CDC. These commands use SQL Server’s built-in sys.sp_cdc_enabled_dbd stored
procedure to perform the necessary updates:
IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'CDCDB')
BEGIN
USE master
CREATE DATABASE
[CDCDB];
PRINT 'db Created...'
END
USE [CDCDB]
GO
IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'User')
BEGIN
PRINT 'CREATING USER TABLE...'
CREATE TABLE [USER](
[UserId] INT IDENTITY(1,1) NOT NULL,
[FirstName] NVARCHAR(50) NOT NULL,
[LastName] NVARCHAR(50) NOT NULL)
END
USE [CDCDB]
GO
EXEC
sys.sp_cdc_enable_db
Once CDC is enabled on the table, a number of system tables
will be created under the CDC schema of the database to store the CDC related
information. These tables include the following
- CDC.captured_columns table that contains the list of
captured column
- CDC.change_tables table that contains the list of tables
that are enabled for capture
- CDC.ddl_history table that records the history of all
the DDL changes since capture data enabled
- CDC.index_columns table that contains all the indexes
that are associated with change table
- CDC.lsn_time_mapping table that is used to map the LSN
number with the time and finally one change table for each CDC enabled
table that is used to capture the DML changes on the source table, as
shown below:
… and the SQL Agent jobs associated to CDC enabled table, the capture and cleanup jobs, will be created like below:
Set up Table-Level CDC
In addition to setting up database-level CDC, you must
also enable CDC on a per-table basis.
This section describes how to set up and verify
table-level CDC on the CDCDB database’s USER table.
Invoke the following commands to set up CDC on
the USER table. Note the source_schema parameter must be assigned the name of the schema
to which the table (USER) belongs, in this case dbo:
USE CDCDB
GO
EXEC
sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'User',
@role_name = NULL;
Verify that CDC is enabled for the USER table
by checking that is_tracked_by_cdc is set to 1 for the database in sys.tables.
Invoke the following command to return all tables
listed in sys.tables:
USE CDCDB
GO
SELECT
object_id,
SCHEMA_NAME(Schema_id) As [Schema Name],
name As [Table Name],
is_tracked_by_cdc
FROM sys.tables
GO
This process results in the creation of a system table
prefixed with cdc. called cdc.Dbo_user_CT that can be seen in Object Explorer and further
verified by querying the table.
Invoke the following command to verify that cdc.Dbo_user_CT can be queried and that a row is returned:
USE CDCDB
GO
SELECT * from cdc.Dbo_user_CT
GO
Disabling CDC
The Change Data Capture can be easily disabled on a specific table using the sys.sp_cdc_disable_table system stored procedure, as shown below:
… or disabled completely at the database level, without the need to disable it on CDC enabled tables one by one, using the sys.sp_cdc_disable_db system stored procedure, as shown below:
Useful queries to work with the CDC.
1.To review the
status of the Log of the databases with the CDC activated.
select name,
log_reuse_wait_desc, is_cdc_enabled from sys.databases
2. On the databases that the CDC has active, we can consult which tables are included in the CDC.
select
name,type,type_desc,is_tracked_by_cdc from sys.tables where is_tracked_by_cdc =
1
3.To review the CDC configuration status.
EXECUTE sys.sp_cdc_help_jobs
4. Change CDC
settings.
EXECUTE
sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 30;
5. To review the
operation of our CDC by database and the errors we can review the following
system views.
select * from
sys.dm_cdc_log_scan_sessions
- Here we will
see all the sessions that have occurred, and in the first row the
aggregate since the CDC was started.
select * from sys.dm_cdc_errors
- Disable CDC
completely.
Use 'Database_name'
GO
EXEC sys.sp_cdc_disable_db
GO
6. CDC.fn_cdc_get_all_changes function can be queried by providing the @from_lsn, @to_lsn, @row_filter_option parameters, that will retrieve all the DML changes information, as shown below: