Troubleshooting

Rows skipped due to message size limits

Artie’s replication pipeline enforces a maximum message size per row to protect the underlying Kafka broker from out-of-memory conditions. The default limit is 2MB per CDC event. This limit can be increased on a case-by-case basis — contact the Artie team if your workload requires a higher threshold.

A CDC event includes both the before and after image of a row, which means a row’s on-disk size can effectively be doubled in transit. A 1.1MB row at rest may produce a 2.2MB CDC event that exceeds the limit.

Rows whose CDC payload exceeds this limit are skipped — replication continues for all other rows. This behavior differs from tools like Debezium, which throw an unrecoverable exception and halt the entire connector when a row exceeds the limit. Artie surfaces skipped rows via the row.skipped webhook event — see Webhooks to set up real-time alerts.

How to identify at-risk tables

Run this query against your source PostgreSQL database to find tables with large rows:

SELECT schemaname, tablename, pg_size_pretty(MAX(pg_column_size(t.*))) AS max_row_size
FROM pg_tables, LATERAL (SELECT * FROM schemaname || '.' || tablename LIMIT 10000) t
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
GROUP BY schemaname, tablename
ORDER BY MAX(pg_column_size(t.*)) DESC;

Mitigations

Transparent message compression — automatically kicks in for rows exceeding 500KB. Enable at the pipeline level in the UI or via Terraform (v2.3.34+). See compression docs.
Exclude large columns — exclude binary blobs or encrypted payloads not needed in the destination. See column exclusion guide.
Webhook notifications — receive real-time alerts when a row is skipped, including table name and primary key. See Webhooks.

Replication slot too large and not decreasing

A growing replication slot means PostgreSQL is retaining WAL segments that have not yet been consumed. If the slot size keeps increasing and never decreases, the retained WAL can eventually exhaust disk space.

Symptoms

Replication slot size is growing continuously
WAL disk usage is increasing or triggering storage alerts
The retained_wal value from the diagnostic query below keeps climbing

Diagnosing the issue

Check your replication slot size and status:

SELECT
  slot_name,
  wal_status,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal
FROM pg_replication_slots;

Check for long-running transactions that may be holding back the slot LSN:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;

Common causes and resolutions

Long-running transactions

Long-running or idle-in-transaction sessions prevent PostgreSQL from advancing the replication slot past their transaction boundary.Resolution: Terminate the blocking session and consider setting idle_in_transaction_session_timeout to automatically kill idle transactions.

-- Terminate a specific session by PID
SELECT pg_terminate_backend(<pid>);

Idle database without heartbeats enabled

On idle databases (especially AWS RDS), WAL segments accumulate because there are no changes for the replication slot to consume. RDS writes internal heartbeats to rdsadmin every 5 minutes, generating ~18 GB of WAL per day on an otherwise idle instance.Resolution: Enable heartbeats in Artie to periodically advance the replication slot. See Enabling heartbeats for setup instructions.For RDS-specific details, see Preventing WAL growth on RDS.

Pipeline is paused or unhealthy

If the Artie pipeline is paused, stopped, or in an error state, it will not consume from the replication slot, causing WAL to accumulate.Resolution: Check the pipeline status in the Artie dashboard and resume or re-deploy it.

max_slot_wal_keep_size is not configured

By default, max_slot_wal_keep_size is set to -1 (unlimited), meaning PostgreSQL will retain WAL indefinitely for a slot. This can lead to unbounded disk growth.Resolution: Set max_slot_wal_keep_size to a reasonable value to cap WAL retention. Note that if the limit is reached, PostgreSQL will invalidate the slot (see Replication slot lost below).

SHOW max_slot_wal_keep_size;

Replication slot lost

A lost replication slot means the slot was either dropped or invalidated, so the pipeline can no longer stream changes from where it left off.

Symptoms

Pipeline errors indicating the replication slot does not exist
Errors referencing WAL segments that have been removed
pg_replication_slots returns no rows for your slot, or shows wal_status = 'lost'

Diagnosing the issue

Check whether the slot still exists and its status:

SELECT slot_name, wal_status, active, restart_lsn
FROM pg_replication_slots;

Check the current max_slot_wal_keep_size setting:

SHOW max_slot_wal_keep_size;

Common causes and resolutions

Slot invalidated by max_slot_wal_keep_size

This is often a consequence of the slot growing too large (see Replication slot too large above). When max_slot_wal_keep_size is configured, PostgreSQL will invalidate any slot whose retained WAL exceeds the limit.Resolution: Re-deploy the pipeline in Artie to recreate the replication slot, then trigger a backfill to re-sync your data. To prevent this from happening again, enable heartbeats to keep the slot advancing. See Enabling heartbeats.

Manual slot deletion

Someone manually dropped the replication slot using pg_drop_replication_slot().Resolution: Re-deploy the pipeline in Artie to recreate the slot, then trigger a backfill.

Database failover

During a failover event (especially on Amazon Aurora), replication slots on the old primary are not automatically carried over to the new primary.Resolution: Re-deploy the pipeline in Artie to create a new replication slot on the new primary, then trigger a backfill to re-sync data.

Provider-specific WAL retention limits

Some managed PostgreSQL providers impose their own WAL retention limits that can cause slot invalidation independently of max_slot_wal_keep_size.Resolution: Check your provider’s documentation for WAL retention policies. Re-deploy the pipeline and trigger a backfill to recover. Enable heartbeats to keep the slot active and prevent future invalidation. See Enabling heartbeats.

Connection terminated by administrator command

If Artie logs show a connection error like:

FATAL: terminating connection due to administrator command (SQLSTATE 57P01)

This means something outside of Artie called pg_terminate_backend() on Artie’s replication connection. Replication: Artie will automatically reconnect and resume from where it left off — no data is lost, but repeated terminations cause unnecessary restarts and latency spikes. Backfills: Backfills do not recover automatically. A terminated backfill connection will cause the backfill job to fail and must be manually re-triggered from the Artie dashboard.

A separate connection reaper is terminating Artie's session

The most common cause is a connection management tool (PgBouncer, an in-house idle connection reaper, or a monitoring script) configured to terminate long-lived or idle-looking sessions. Replication connections are persistent by design — they appear idle between commits, which makes them a common false-positive target for reapers.Resolution: Identify the tool or script calling pg_terminate_backend() and exclude Artie’s replication user or walsender backend type from its termination logic.

Database restart or failover

A database restart (planned maintenance, provider-initiated restart, or failover) will terminate all active connections with this error.Resolution: No action needed for replication — Artie will reconnect automatically. For backfills, re-trigger them from the Artie dashboard once the database is back up.

Setting max_slot_wal_keep_size

max_slot_wal_keep_size caps how much WAL Postgres retains for a replication slot before invalidating it. Treat this as a break-glass mechanism - if the limit is hit, the slot is invalidated and a full backfill is required to recover.

Setting this value too low will cause frequent slot invalidation and repeated backfills, which is far more disruptive than retaining extra WAL. Err on the side of setting it higher.

When choosing a value, consider:

Your database size - larger databases generate more WAL and need more headroom.
Historical slot size - check how large your slot has grown during normal operations and past incidents (pipeline pauses, long-running transactions, etc.).
Set it high enough that you don’t have to think about it - aim for at least 3-5x your observed peak slot size.

-- Check current slot size
SELECT
  slot_name,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal
FROM pg_replication_slots;

Pair this with heartbeats to keep the slot advancing during idle periods.

Getting Started

Pipelines

Connectors

Monitoring

Artie Dashboard

Troubleshooting

Rows skipped due to message size limits

How to identify at-risk tables

Mitigations

Replication slot too large and not decreasing

Symptoms

Diagnosing the issue

Common causes and resolutions

Replication slot lost

Symptoms

Diagnosing the issue

Common causes and resolutions

Connection terminated by administrator command

Setting max_slot_wal_keep_size

​Rows skipped due to message size limits

​How to identify at-risk tables

​Mitigations

​Replication slot too large and not decreasing

​Symptoms

​Diagnosing the issue

​Common causes and resolutions

​Replication slot lost

​Symptoms

​Diagnosing the issue

​Common causes and resolutions

​Connection terminated by administrator command

​Setting max_slot_wal_keep_size

Rows skipped due to message size limits

How to identify at-risk tables

Mitigations

Replication slot too large and not decreasing

Symptoms

Diagnosing the issue

Common causes and resolutions

Replication slot lost

Symptoms

Diagnosing the issue

Common causes and resolutions

Connection terminated by administrator command

Setting max_slot_wal_keep_size