This is the first part of our deep dive into Postgres logical replication. In this part, we will focus on
the REPLICA IDENTITY
property of a table and how it affects logical replication.
Postgres tables require a replica identity to be configured in order to capture the changes made to the table. Replica identity specifies the type of information written to the write-ahead log with respect to what the previous values were.
By default, replica identity will use the table's primary keys as the identifiers.
Our goal with this blog post is to demystify the concept of replica identity and explain why you would want to alter your replica identity in certain situations. We also discuss performance implications and other factors to consider.
The key components to logical replication:
Notes:
Component | Description |
---|---|
Write-ahead logs | Postgres uses write-ahead logs (WAL) internally to record every database transaction. You can control how much information is written by toggling WAL_LEVEL , which we will cover in another blog post. |
Publication | A publication is a collection of tables that you want to replicate. Changes made to those tables are then captured and sent to subscribers. You can create a publication with the CREATE PUBLICATION command. |
You can change a table's replica identity by running a command like this:
ALTER TABLE table REPLICA IDENTITY value;
The valid settings are:
DEFAULT
USING INDEX index_name
(must be unique)FULL
NOTHING
For this section, we are using this as the example.
CREATE TABLE customers (
id integer DEFAULT nextval('inventory.customers_id_seq'::regclass) PRIMARY KEY,
first_name character varying(255) NOT NULL,
last_name character varying(255) NOT NULL,
email character varying(255) NOT NULL UNIQUE
);
CREATE UNIQUE INDEX customers_email_key ON inventory.customers(email text_ops);
INSERT INTO customers (id, first_name, last_name, email)
VALUES (1001, 'Sally', 'Thomas', '[email protected]');
UPDATE customers SET first_name = 'Sally #2', last_name = 'Thomas #2' where id = 1001;
The default replica identity is the primary key(s) of the table. If you are trying to replicate a table that does not
have primary key(s), you can either alter the table to add keys (see our
guide here), or you can modify the replica identity
to be FULL
.
Kafka message produced by Debezium
Partition Key:
{"payload":{"id":1001}}
Message:
{
"payload": {
"before": null,
"after": {
"id": 1001,
"first_name": "Sally #2",
"last_name": "Thomas #2",
"email": "[email protected]"
},
"op": "u",
"ts_ms": 1716575453417,
"transaction": null
}
}
The USING INDEX
replica identity will capture the columns of the index you specify. The index must be unique and its columns must be NOT NULL
.
Kafka message produced by Debezium
Partition Key:
{"email":"[email protected]"}
Message:
{
"payload": {
"before": null,
"after": {
"id": 1001,
"first_name": "Sally #2",
"last_name": "Thomas #2",
"email": "[email protected]"
},
"op": "u",
"ts_ms": 1716575622471,
"transaction": null
}
}
The FULL
replica identity will capture all the columns of the table.
Kafka message produced by Debezium
Partition Key:
{"payload":{"id":1001}}
Message:
{
"payload": {
"before": {
"id": 1001,
"first_name": "Sally",
"last_name": "Thomas",
"email": "[email protected]"
},
"after": {
"id": 1001,
"first_name": "Sally #2",
"last_name": "Thomas #2",
"email": "[email protected]"
},
"op": "u",
"ts_ms": 1716575379815,
"transaction": null
}
}
The NOTHING
replica identity will not capture any columns of the table. This will cause an error when you try to update
a table with this replica identity.
ERROR: cannot update table "customers" because it does not have a replica identity and publishes updates
The main use case for changing replica identity is to change from DEFAULT → FULL
. You would typically do this for two
reasons:
Scenario 1: you don't want to deal with TOAST columns
In a nutshell, TOAST columns are large columns where the values will not be written to WAL if the values did not change. This then means that your downstream application needs to understand the TOASTED value placeholder and handle it accordingly.
For Artie customers, we automatically detect and handle this by adding conditional update clauses.
If you’re not using Artie, you can handle TOAST columns by doing something like this:
UPDATE target t SET t.toast_column = s.toast_column FROM staging as s WHERE t.id = s.id;
To handle a TOAST column, we leverage CASE statements.
UPDATE target t
SET t.toast_column = CASE
WHEN ( s.toast_column != 'TOASTED_VALUE_PLACEHOLDER' )
THEN s.toast_column ELSE t.toast_column
END
FROM staging AS s
WHERE t.id = s.id;
Scenario 2: you need the previous row value
One common reason we change replica identity to FULL
at Artie is that we need the previous values in order to replicate a deleted row to a downstream table that has cluster keys specified.
For example, you may have a table called accounts
, and in Snowflake the table is clustered by DATE_TRUNC('day', created_at)
. We will need the created_at
for the previous row if we want to replicate a DELETE
event.
Workaround
If you need created_at
and don't want to change the replica identity, you can add created_at
to be part of the table's primary key.
ALTER TABLE customers DROP CONSTRAINT customers_pkey;
ALTER TABLE customers ADD PRIMARY KEY (id, created_at);
This is a great workaround for tables that have a lot of TOAST columns.
TL;DR -
DEFAULT → FULL
on a table-by-table basis is mostly fine.FULL
replica identity.By setting replica identity to FULL
:
The actual incremental load depends on:
INSERT
, UPDATE
, DELETE
events. INSERT
doesn't have previous data, so it would not
increase load.Click here to see benchmark data from Xata.