Preventing WAL Growth on Postgres DB Running on AWS RDS

Software Engineering
Updated on
June 10, 2024

From working with Postgres on AWS RDS, we have noticed that idle or dev databases will occasionally experience replication slot overflow and cause issues with customers. It’s also happened to us while testing out CDC pipelines. We wrote this guide to describe why this issue happens and how to prevent replication slot overflow.

Artie Cloud customers can avoid replication slot overflow for low traffic databases by enabling heartbeats on the dashboard.

For anyone using Postgres on AWS RDS, it’s possible you’ve run into the scenario of an idle or unused dev database suddenly running out of memory and causing replication slot overflow. All of a sudden you’re up in the middle of the night trying to debug the issue and understand why your database is down and why your database storage is full.

What is causing replication slots to overflow?

When doing anything related to CDC (change data capture), a replication slot in Postgres is created. For Postgres specifically, CDC events are recorded and stored in the database’s WAL (write-ahead logging). WAL exists to ensure data integrity and log the records describing the changes. It’s also helpful for downstream applications to subscribe to and consume CDC events reliably.

Replication slot overflow happens when the WAL accumulates and grows, consumes all your database’s memory, and causes your database to go down. But, why does this happen to idle/unused databases?

The underlying reason is due to AWS heartbeats. AWS RDS periodically writes a heartbeat to a table in the database every 5 minutes, whether the database is being actively used or not. These heartbeats sit in the WAL and the default WAL segment size is 64MB. The purpose for AWS writing heartbeats is for various reasons, including to monitor the health of their databases.

For an active database, heartbeats do not cause an issue because the WAL is almost constantly being drained as new CDC logs are processed by downstream consumers. However, for idle databases with no CDC logs, heartbeats result in WAL accumulating by 64MB every 5 minutes, or 18.4GB per day! Heartbeats are not observed by most CDC applications, so it doesn't get processed. After enough time has passed, the replication slot overflows and your database goes down.

How do we prevent WAL growth for a PostgreSQL database running on AWS RDS?

Note that heartbeats is only necessary for low traffic and idle databases (dev or test) and not necessary for active databases.

If you are using Debezium to read CDC logs from the WAL, you can turn on the heartbeats feature. You have to first create a heartbeat_table in the idle/low traffic database and Debezium will periodically ping the database with an update, which will create a CDC event and prevent WAL growth.

-- Creating a heartbeats table
CREATE TABLE test_heartbeat_table (id text PRIMARY KEY, ts timestamp);
-- Insert one row so subsequent updates will create a CDC event
INSERT INTO test_heartbeat_table (id, ts) VALUES (1, NOW());

When heartbeats are enabled, Artie will then periodically issue this command to advance the replication slot.

UPDATE test_heartbeat_table set ts = now() where id = '1';

Other best practices to consider

  • Monitor your instance for free_storage_space and set up alerts
  • Enable storage autoscaling
  • Are there any long-running queries?
  • Query pg_locks to identify if any locks are being held for a long time
  • Set statement_timeout and idle_in_transaction_session_timeout to prevent long-running queries
  • Configure max_slot_wal_keep_size to prevent replication slot overflow.
    • If units are not specified, the values will be in megabytes.
    • If the replication slot size is reached, the slot will be automatically dropped to protect the database from crashing.

Identify long-running queries:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute';

Identify any locks:

SELECT * FROM pg_locks;
Table of contents
    Author
    Jacqueline Cheong
    Co-founder & CEO