June 19, 2025

CDC for Tables Without Primary Keys

Some tables are weird. No primary key (PK), maybe just a unique index or some composite hack someone added in 2017. Until now, those were off-limits for replication.

You can now override PK requirements by specifying a unique index — including composite indexes. Artie will respect the exact column order to ensure optimal performance.

Why index-based PK overrides matter:

Not every table has a clean PK. Some use unique indexes or composite keys that aren’t formally declared as PKs. Until now, these tables were difficult (or impossible) to replicate. This change addresses one of the most common blockers for CDC at scale.

What’s changed:

  • PK override: Define row identity with a unique index
  • Use composite keys — even if unofficial or unenforced
  • Preserve the exact index column order – it affects how changes are captured and impacts query performance during replication (e.g., email, account_id, created_at)

This unlocks flexible replication for legacy systems, denormalized tables, and high-volume sources — without compromising performance.

When to use index-based keys:

  • Your table lacks a formal PK, but has a unique constraint or index
  • You rely on composite keys to identify rows
  • You’re dealing with legacy systems or data models that weren’t built with CDC in mind

How to enable key overrides:

Reach out to enable key overrides — we’ll help define your index logic and validate it during setup.

June 17, 2025

Backfill Tuning: Picking the Right Batch Size

You can now control how many rows Artie processes at a time during backfills. The default is now 25,000 rows per chunk (up from 5,000), but you can tune this based on performance vs. load tradeoffs.

Why backfill batch size matters:

Backfills aren’t one-size-fits-all. Some teams want speed. Others are sensitive to database load and tiptoeing around a production DB at 2am. Until now, everyone got the same batch size of 5,000 rows per chunk. Now you can tune backfills to match your style:

  • The default is 25,000 rows — we benchmarked a bunch of sizes. 25,000 rows won out. So that’s our new default
  • You have control — adjust the batch size to fit your environment

How to tune batch size based on your workload:

  • Speed up backfills: Larger chunks = fewer queries, can improve throughput, but overly large chunks can backfire. It’s about finding balance.
  • Reduce DB load: Smaller chunks = faster queries, lower impact on source

Need help tuning batch size?

If you’re unsure what batch size is right for your workload, reach out — we’ll help you tune it.

Read Once and Write to Multiple Destinations

You can now sync data from a single database to multiple destinations — all from the same connector.

Why this matters:

  • Reduce load on production databases by avoiding duplicate reads and minimizing replication slot overhead
  • Ability to fan out to multiple tools — e.g., write to both Snowflake and Redshift
  • Ability to support diverse use cases in parallel — analytics, ML, real-time alerting

This feature is designed for organizations that:

  • Operate across multiple data platforms 
  • Serve many internal teams with different tools
  • Need to scale data infrastructure without increasing operational burden

If you’re planning a multi-destination architecture, we’d be happy to help — just reach out.

June 2, 2025

Iceberg Support Using S3 Tables

This launch adds something big: support for Apache Iceberg using S3 Tables.

Artie customers can now:

  • Stream high-volume datasets into Iceberg-backed tables stored on S3
  • Use S3 Tables’ fully managed catalog, compaction, and snapshot management
  • Query efficiently with Spark SQL (via EMR + Apache Livy) without wrestling with cluster glue
  • Get up to 3x faster query performance thanks to automatic background compaction

Why is Iceberg a big deal? Because it solves what’s frustrating and limiting about traditional S3-based data lakes. Hive tables are rigid and brittle, with no snapshotting or time travel. Delta Lake is powerful but tied to the Databricks ecosystem. Plain S3 file storage? No metadata layer, no transactions, no query optimizations.

Instead, Iceberg gives you a fully open, cloud-native table format with smooth schema evolution, hidden partitions, snapshot isolation, and time-travel queries – all with broad engine support (Spark, Trino, Flink, Presto, Hive).

We’re excited about this because it means Artie customers can confidently move massive data volumes without needing to hand-build the plumbing – Iceberg and S3 Tables handle schema changes, partitioning, compaction, and snapshot management behind the scenes, so the system scales cleanly without brittle, custom workflows.

📚 Want to set up Iceberg-backed pipelines? Docs to get started: https://artie.com/docs/destinations/iceberg/s3tables

May 14, 2025

S3 Iceberg destination (Beta)

S3 Iceberg is now available in beta! This new destination uses AWS’s recently released S3 Tables support, allowing you to replicate directly into Apache Iceberg tables backed by S3. It’s a big unlock for teams building modern lakehouse architectures on open standards.

Column Inclusion Rules

You can now define an explicit allowlist of columns to replicate - ideal for PII or other sensitive data. This expands our column-level controls alongside column exclusion and hashing. Only the fields you specify get replicated. Everything else stays out.

Autopilot for New Tables

Stop manually hunting for new tables in your source DB. Autopilot finds and syncs them for you - zero config required. Turn it on via:

Deployment → Destination Settings → Advanced Settings → “Auto-replicate new tables”

Data Quality: Rows Affected Checks

To further enhance the data integrity built into our pipeline, we’ve added another guardrail: verifying the number of rows affected during each database operation.

For example, during merge steps (such as in Snowflake), we confirm ROWS_LOADED from copy commands and validate the totals for inserted, updated, or deleted rows. This approach reinforces the robustness of our data replication process and it’s another way we catch issues early and ensure replication integrity.

Read Once, Write Many

We recently launched the ability to read-once and write to multiple destinations. This means you no longer need multiple replication slots on your source database.

For example, by reading data just once from your Postgres instance and simultaneously replicating it to Snowflake and Redshift, you reduce database overhead and simplify replication architecture.‍

Multi-Data Plane Support

Artie now supports hosting pipelines across multiple data planes, whether you’re on our cloud or using your own (BYOC) infrastructure.

For example, run one pipeline from Postgres to Snowflake in AWS US-East-1 and another from MySQL to Snowflake in AWS US-West-2.

Oracle Fan-in

With our Oracle Fan-in feature, you can now easily replicate data from thousands of Oracle sources - without painful manual setups or infrastructure overload. Fan-in reduces your Kafka topic sprawl, lowers infrastructure costs, and simplifies real-world, complex data replication.