Skip to main content

Documentation Index

Fetch the complete documentation index at: https://artie.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Replication slot too large and not decreasing

A growing replication slot means PostgreSQL is retaining WAL segments that have not yet been consumed. If the slot size keeps increasing and never decreases, the retained WAL can eventually exhaust disk space.

Symptoms

  • Replication slot size is growing continuously
  • WAL disk usage is increasing or triggering storage alerts
  • The retained_wal value from the diagnostic query below keeps climbing

Diagnosing the issue

Check your replication slot size and status:
SELECT
  slot_name,
  wal_status,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal
FROM pg_replication_slots;
Check for long-running transactions that may be holding back the slot LSN:
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;

Common causes and resolutions

Long-running or idle-in-transaction sessions prevent PostgreSQL from advancing the replication slot past their transaction boundary.Resolution: Terminate the blocking session and consider setting idle_in_transaction_session_timeout to automatically kill idle transactions.
-- Terminate a specific session by PID
SELECT pg_terminate_backend(<pid>);
On idle databases (especially AWS RDS), WAL segments accumulate because there are no changes for the replication slot to consume. RDS writes internal heartbeats to rdsadmin every 5 minutes, generating ~18 GB of WAL per day on an otherwise idle instance.Resolution: Enable heartbeats in Artie to periodically advance the replication slot. See Enabling heartbeats for setup instructions.For RDS-specific details, see Preventing WAL growth on RDS.
If the Artie pipeline is paused, stopped, or in an error state, it will not consume from the replication slot, causing WAL to accumulate.Resolution: Check the pipeline status in the Artie dashboard and resume or re-deploy it.
By default, max_slot_wal_keep_size is set to -1 (unlimited), meaning PostgreSQL will retain WAL indefinitely for a slot. This can lead to unbounded disk growth.Resolution: Set max_slot_wal_keep_size to a reasonable value to cap WAL retention. Note that if the limit is reached, PostgreSQL will invalidate the slot (see Replication slot lost below).
SHOW max_slot_wal_keep_size;

Replication slot lost

A lost replication slot means the slot was either dropped or invalidated, so the pipeline can no longer stream changes from where it left off.

Symptoms

  • Pipeline errors indicating the replication slot does not exist
  • Errors referencing WAL segments that have been removed
  • pg_replication_slots returns no rows for your slot, or shows wal_status = 'lost'

Diagnosing the issue

Check whether the slot still exists and its status:
SELECT slot_name, wal_status, active, restart_lsn
FROM pg_replication_slots;
Check the current max_slot_wal_keep_size setting:
SHOW max_slot_wal_keep_size;

Common causes and resolutions

This is often a consequence of the slot growing too large (see Replication slot too large above). When max_slot_wal_keep_size is configured, PostgreSQL will invalidate any slot whose retained WAL exceeds the limit.Resolution: Re-deploy the pipeline in Artie to recreate the replication slot, then trigger a backfill to re-sync your data. To prevent this from happening again, enable heartbeats to keep the slot advancing. See Enabling heartbeats.
Someone manually dropped the replication slot using pg_drop_replication_slot().Resolution: Re-deploy the pipeline in Artie to recreate the slot, then trigger a backfill.
During a failover event (especially on Amazon Aurora), replication slots on the old primary are not automatically carried over to the new primary.Resolution: Re-deploy the pipeline in Artie to create a new replication slot on the new primary, then trigger a backfill to re-sync data.
Some managed PostgreSQL providers impose their own WAL retention limits that can cause slot invalidation independently of max_slot_wal_keep_size.Resolution: Check your provider’s documentation for WAL retention policies. Re-deploy the pipeline and trigger a backfill to recover. Enable heartbeats to keep the slot active and prevent future invalidation. See Enabling heartbeats.

Setting max_slot_wal_keep_size

max_slot_wal_keep_size caps how much WAL Postgres retains for a replication slot before invalidating it. Treat this as a break-glass mechanism - if the limit is hit, the slot is invalidated and a full backfill is required to recover.
Setting this value too low will cause frequent slot invalidation and repeated backfills, which is far more disruptive than retaining extra WAL. Err on the side of setting it higher.
When choosing a value, consider:
  1. Your database size - larger databases generate more WAL and need more headroom.
  2. Historical slot size - check how large your slot has grown during normal operations and past incidents (pipeline pauses, long-running transactions, etc.).
  3. Set it high enough that you don’t have to think about it - aim for at least 3-5x your observed peak slot size.
-- Check current slot size
SELECT
  slot_name,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal
FROM pg_replication_slots;
Pair this with heartbeats to keep the slot advancing during idle periods.