Changelog - Artie

October 2025

Improved Pipeline Overview

When you’re managing hundreds (or thousands) of tables across multiple pipelines, finding the one table that’s lagging shouldn’t feel like a scavenger hunt. Until now, visibility across pipelines meant jumping between tabs and manually piecing things together.The new Pipeline Overview provides a unified list of all replicated tables per pipeline — complete with instant search, filters, and live status metadata. You can now quickly slice by schema, filter by status, or pinpoint a specific table, all from one clean interface. It’s faster, clearer, and built for scale.Why this matters:

One consolidated view of every table in your pipeline
Instant search and filtering for faster troubleshooting
Live status updates make it easy to spot and prioritize issues
Simplifies triage and impact analysis for large-scale replication setups
Big quality-of-life boost for teams managing many pipelines

SQL Server Change Tracking

Not every SQL Server environment allows full CDC or log access — especially in managed or restricted setups. For many teams, that’s meant choosing between limited replication options or overburdening DBAs just to enable streaming.Artie now supports Change Tracking (CT) as a replication method, alongside CDC and T-SQL log reading. CT captures primary keys and version numbers for changed rows directly from SQL Server, enabling low-latency, incremental syncs without requiring elevated permissions or direct log access.You can enable CT on selected tables, and Artie will automatically detect and replicate changes in near real time — maintaining the same reliability and consistency you expect from our streaming architecture.Why this matters:

Lightweight replication with minimal database overhead
Works in restricted SQL Server environments where CDC/log access isn’t possible
Reduces dependency on DBAs and complex permissions
Maintains low-latency, reliable syncs across managed and hosted deployments
Expands deployment flexibility for diverse SQL Server environments

Learn more in our docs: SQL Server Change Tracking →

Native PagerDuty Paging

When a pipeline stalls at 2 a.m., the last thing you need is another integration breaking. Until now, connecting Artie alerts to PagerDuty required custom webhooks or scripts — brittle setups that quietly fail when you need them most.You can now connect Artie monitors directly to PagerDuty. No glue code. No webhook maintenance. When a pipeline alert fires — say latency spikes past five minutes or data volume drops below expected thresholds — Artie automatically triggers an incident in PagerDuty for the correct service, complete with context like pipeline name, destination, and error type.Why this matters:

Native integration with your existing PagerDuty schedules and escalation policies
Faster triage with alerts that include full pipeline context
Lower MTTR by unifying alerting and incident management
Zero-maintenance setup — no webhooks or scripts to babysit
Fewer missed alerts and more reliable on-call coverage

Supporting Amazon Keyspaces (Cassandra) as a Source

Cassandra workloads are built for massive scale and availability — but getting that data into your analytics or AI systems in real time has always been a pain. Teams have had to choose between building custom pipelines or living with stale data. That stops now.Artie can now stream data directly from Amazon Keyspaces into your destinations — no glue code, no custom connectors. Backfill and stream changes from wide-column Cassandra tables into Snowflake, Redshift, or Databricks, with full handling of partition keys, clustering columns, and all native Cassandra data types. The result: low-latency replication that just works, delivering sub-minute end-to-end freshness.Why this matters:

Real-time analytics and AI on Cassandra data — without manual pipelines
Unlocks operational data trapped in wide-column stores
Full schema mapping and type handling for Cassandra’s flexible model
Consistent, predictable performance at scale
Lower operational risk and engineering overhead

Displaying Redshift Views & DDL in Catalog

When debugging or reviewing data pipelines, context matters. Until now, seeing the full picture of your Redshift environment often meant hopping between Artie and Redshift to check view definitions or table schemas. That context switching slows teams down and introduces room for error.You can now see Redshift views alongside tables—complete with their full CREATE VIEW and CREATE TABLE definitions—right inside Artie’s data catalog. The catalog reflects the authoritative Redshift schema, so you get an instant, accurate snapshot of how data is structured without leaving Artie.Why this matters:

No more switching between Artie and Redshift just to check definitions
Faster debugging and impact analysis with full view context in one place
Better governance — everyone sees the same, authoritative schema version
Fewer mistakes when modifying or extending pipelines
Improved collaboration between engineers and analysts reviewing data models

Soft Partitioning

High-volume ingestion often strains data warehouses — especially when append-heavy streams like events or transactions flood a single target table. Over time, that leads to slower merges, higher costs, and unpredictable performance. Artie’s new Soft Partitioning feature changes that.Soft Partitioning introduces logical, time-based partitioning at Artie’s ingestion layer. Artie automatically routes incoming rows into partitioned tables (like user_events_2025_08, user_events_2025_09, …) while maintaining a unified view (user_events) that queries seamlessly across all partitions. This works independently of your destination’s native partitioning — so you get consistent, predictable performance whether you’re writing to Snowflake, Redshift, or Databricks.Why this matters:

Predictable write performance for high-throughput, append-only streams
Reduced merge and update costs by limiting heavy work to recent time slices
Full control over partition lifecycle — prune, compact, or roll up by time range
Unified view across partitions for simple, cross-partition querying
Consistent behavior across all destinations, without custom tuning

Displaying Unreadable Schemas in Postgres

Until now, our Postgres dashboard only showed schemas that your service account could read. For most teams sticking with the default schema, that was fine — but for anyone using custom schemas, it could feel like parts of your database had simply disappeared.Your Postgres dashboard will now surface all schemas, even if Artie doesn’t have read access. When a schema is unreadable, we’ll explicitly mark it as such. Instead of leaving you guessing why only a subset showed up, you now get full visibility into what’s there — and what’s out of reach.Why this matters:

Clearer visibility: see all schemas in your database, not just the readable ones
Less confusion: no more wondering why certain schemas aren’t appearing
Faster troubleshooting: instantly know whether access permissions are the reason a schema isn’t syncing
More control: make informed decisions on which schemas to grant access to

View Only Role

Until now, access to the dashboard was limited to admin and edit roles. That worked fine for smaller teams — but for larger orgs, it meant you either gave stakeholders too much power or kept them out entirely.With the new view only role, you can invite stakeholders into the dashboard with read-only access. They’ll be able to see pipelines, monitor sync status, and stay informed — but won’t be able to change settings or modify configurations. This strikes the right balance: visibility without risk.Why this matters:

Invite stakeholders who only need visibility into pipeline health and status
Reduce the risk of accidental changes to critical configurations
Scale collaboration across larger teams without adding overhead
Keep admins and engineers focused on managing pipelines, while others stay informed

Schema Exclusion Rules for Postgres Fan-In

When fanning in data from Postgres, things can get tricky if two schemas on the same server contain tables with the same name — but different formats. For example, tenant-specific schemas should be unified, while other schemas with overlapping table names should stay separate. Without guardrails, this creates collisions and inconsistent data downstream.You can now use regular expressions to exclude specific schemas from automatic fan-in. To make setup easier, Artie previews which schemas match (and don’t match) your rule before replication starts — giving you full visibility and control.Why this matters:

Designed for customers managing complex fan-in scenarios
Prevents incorrect unification of tables that share names but differ in structure
Reduces downstream data errors and inconsistencies
Eliminates manual filtering and fragile workarounds
Provides transparency by previewing schema matches before syncing

September 2025

Postgres: Publication Support Via Partition Root

Partitioned tables in Postgres can be tricky to manage. Until now, Artie required users to define a regex pattern to capture all partition names — which worked fine, but made replication fragile and error-prone for complex partitioning strategies.With Postgres’ built-in publish_via_partition_root option, we’re making things much simpler. Instead of juggling regex patterns, you can now rely on Postgres to publish changes directly through the root table.If you’re using partitioned tables (for example, events_2025_09, events_default, or any custom partition scheme), Artie can now replicate them as if they were a single root table. No regex, no manual upkeep — just treat your partitioned tables like regular ones and let Artie handle the rest.Why this matters:

Less fragile: no need to maintain regex rules across changing partitions
Simpler setup: treat partitioned tables like any other Postgres table
Fewer errors: Postgres ensures changes flow through the root, reducing replication pitfalls
More flexibility: works with custom partitioning schemes, not just time-based ones

Backfill by Schema for Fanned-In Tables

When tables are fanned into a single destination table from multiple schemas, backfills used to be all-or-nothing. That worked fine for most teams — but sometimes you only need a slice. Maybe one schema changed while the others stayed the same. Until now, you’d have to backfill everything, even if just one schema needed attention.You can now target specific schemas when backfilling a fanned-in table. For example, if a destination table is built from schemas A, B, and C, but only schema C requires a refresh, you can backfill just schema C without touching A or B. This gives you more control, avoids unnecessary data churn, and keeps things lean.Why this matters:

More precise backfills — update only what you need
Faster recovery from schema-specific changes
Lower compute and warehouse costs by skipping redundant data
Less operational risk when dealing with large fan-in pipelines

Full Source Table Name Support

When data is sharded across multiple upstream tables, primary keys are often only unique within each table—not across all of them. Until now, that meant fanning data into a single destination table wasn’t possible without risking collisions.You can now include the complete source table name ({{database}}.{{schema}}.{{table}}) as a column in your replication stream. This provides additional metadata about where each row originated. You can also choose to make this new column part of the primary key — giving you a surrogate key that ensures uniqueness across shards.Why this matters:

Replicate sharded datasets into a single destination table without key conflicts
Preserve visibility into the exact origin of each record
Eliminate the need for complex workarounds or manual deduplication
Unlock new fan-in use cases for teams consolidating data from many sources

Pending Status for Pipelines

Sometimes, when you make a change to a pipeline, it can take a few minutes before everything is deployed to our data plane. Until now, the UI didn’t give you much feedback — leaving some customers wondering if their update had gone through.Pipeline actions are now displayed with a pending state while they’re being applied. Instead of guessing whether your deployment worked, you’ll see right away when a pipeline is in progress — and when it’s fully live.Why this matters:

Removes confusion when pipeline changes take 2–5 minutes to deploy
Provides clear visibility into the status of your updates
Helps teams trust what they’re seeing in the dashboard
Reduces unnecessary support questions like “I clicked update, but nothing happened”

Consolidated Pipeline Change Notifications

Managing dozens of pipelines shouldn’t mean managing dozens of alerts. Until now, schema changes or pipeline issues like credential rotations triggered one email per pipeline. That worked fine when you had just a handful of pipelines, but for teams running 30, 50, or even 80+ pipelines, inboxes quickly got overloaded.Instead of sending a separate email for each pipeline event, Artie now rolls them up into a single alert. You’ll still see exactly which pipelines are affected — whether it’s a schema change, a connection timeout, or a credential update — but without the avalanche of duplicate emails.Why this matters:

Cut down on alert fatigue — fewer emails, clearer signal
Immediate visibility into which pipelines are affected, all in one place
Scales with your environment, whether you manage 3 pipelines or 80
Keeps alerts actionable — you know what happened, where, and when

New Data Type Support for Postgres

Postgres is powerful because of its flexibility, but until now, many advanced data types weren’t supported in replication tools. That meant teams using features like multiranges or custom composite types had to rework schemas or maintain brittle workarounds — slowing them down.With this update, Artie removes that limitation. For teams building production-critical systems on Postgres, you can now replicate complex data types with the same reliability as standard ones.We’ve added support for:

TSTZMULTIRANGE (multirange): Added in Postgres 14, this lets you store multiple non-overlapping time intervals in one column — perfect for tracking availability windows without conflicts.
Custom enums: Define your own set of valid string values, like a controlled list of statuses or product sizes, and replicate them reliably.
Custom composite (tuple) types: Create structured types that combine multiple fields into one, such as storing an address (city, state, street) in a single column.

Why this matters:

Supports advanced Postgres use cases without schema redesign
Eliminates brittle workarounds when using complex data types
Keeps replication consistent across real-world customer schemas
Opens the door for more application-specific modeling in Postgres

Static Columns

Sometimes, teams need more than just the raw data replicated into their warehouse. They also need a way to enrich it with business context — like tags, labels, or metadata that help downstream teams stay organized. Until now, this meant managing that metadata separately, which could lead to extra steps and inconsistencies.You can now add static columns to an existing pipeline. These columns will automatically appear in the destination table alongside your replicated data.For example, a team might use static columns to tag records with region (EU or US), environment (prod or staging), or a unique system identifier. Instead of tracking this separately, the metadata now flows directly with your replicated data — no extra plumbing required.Why this matters:

Keep your downstream data organized with consistent metadata
Add business context (like region or environment) directly into destination tables
Reduce manual tagging and cleanup steps after replication
Enable richer analytics and easier filtering across datasets

Source Metadata Columns

When you’re consolidating data from multiple sources, it’s not always enough to just move the rows — you also need to know where they came from. Without that context, compliance checks get harder, debugging slows down, and downstream apps lose critical signals. Artie makes it simple to retain that lineage.You can now enable source metadata columns in Advanced Settings. When turned on, Artie appends an extra column to replicated tables containing details like transaction ID, log sequence number, schema, table, and database name.For example:

A fintech consolidating dozens of sharded MySQL databases into one Snowflake table can track exactly which shard each row came from.
A healthcare company can capture source database and table information for HIPAA audit logs.
A payments team can tie replicated rows back to original transaction IDs for fraud analysis.

Why this matters:

Traceability: See exactly where each row originated, even across shards.
Auditability: Support compliance and security workflows with source-side context.
Debugging: Isolate discrepancies quickly by filtering on schema/table metadata.
Flexibility: Build custom fraud detection, routing, or monitoring logic using metadata.

August 2025

Unifying Tables Across Schemas

For teams managing sharded or micro-sharded databases, downstream complexity multiplies fast. Each shard or schema produces its own copy of every table. That means instead of M tables, you end up with N × M tables in your warehouse (where N = number of shards/schemas, M = number of tables). Analysts are stuck stitching them back together, engineers write endless union queries, and operations teams lose the clean, consolidated view they need.You can now unify tables across schemas directly in replication. Instead of landing one table per schema, Artie automatically merges them into a single, consolidated destination table.Take an e-commerce platform sharding customers across 50+ schemas: instead of 50 separate users tables, you now get one unified users table downstream. Or a payments company splitting transactions across micro-shards: all those rows flow neatly into one transactions table in Snowflake. With Artie’s fan-in option, the number of downstream tables is simplified back to M, and schema evolution is handled automatically.Why this matters:

Simplified data model: query one table instead of wrangling dozens, with schema evolution managed for you.
No duplication: eliminate manual unions or stitching scripts in the warehouse.
Consistent structure: unified naming across shards improves data quality and usability.
Effortless scaling: add new shards upstream, and they automatically merge downstream.

Microsoft Teams Notifications

Artie has long supported notifications via email and Slack. That worked fine, but for teams who live in Microsoft Teams, having alerts show up directly in their workspace is a big quality-of-life improvement.You can now receive pipeline alerts and notifications directly in Microsoft Teams. Whether it’s replication status, schema changes, or operational alerts, everything flows into the same workspace your team already uses. This builds on our existing notification support (email and Slack), giving you another way to stay connected to your pipelines.Docs 👉 How to enable Teams notifications

Large JSON Support for Redshift

JSON payloads are everywhere — but when they get large, most tools fall short. Many platforms land JSON into VARCHAR(MAX), which caps out at ~65k characters. For teams working with rich event logs, nested API responses, or anything more complex, that limit means truncated payloads and lost data.When Artie encounters large JSON payloads, we now land them as the SUPER data type in Amazon Redshift. SUPER supports documents up to 16MB in size, preserving the full payload without truncation. That means you can capture, query, and transform large, complex JSON documents in Redshift without compromise.Why this matters:

Preserve entire JSON payloads instead of losing data to truncation
Unlock the full power of Redshift’s semi-structured query capabilities with SUPER
Handle large event logs, nested API responses, and other big JSON columns with ease
Avoid manual workarounds or post-processing to recover lost information

Working with big JSON in Redshift and want to see the difference? Reach out to our team — we’ll help you get set up.

Specifying Snowflake Roles

Some Snowflake service accounts are like Swiss Army knives — they have multiple roles, each with its own permissions and environment. Until now, Artie simply authenticated with the service account’s default role. That worked for straightforward setups, but for teams running multiple environments (like staging, pre-prod, and prod) from a single service account, it meant juggling credentials or sticking to a one-size-fits-all role.You can now tell Artie exactly which Snowflake role to use when authenticating with a service account.

Example: If your company runs staging, pre-prod, and prod in Artie, you can use a single service account for all of them — just assign a different role per environment. This keeps credentials simple while ensuring each environment only has access to what it needs.Why this matters:

Simplifies credential management — no more creating and rotating multiple service accounts for different environments
Keeps environments isolated — staging stays staging, prod stays prod, even with the same account
Supports better security practices — roles can be scoped to the exact permissions needed
Reduces operational overhead — fewer accounts to configure, monitor, and maintain

Sub-Second Pipeline Deployment

When you’re rolling out dozens (or hundreds) of pipelines at once, every second counts. The old 3–5 second deploy time per pipeline worked fine for smaller updates — but for large-scale rollouts, those seconds piled up fast, sometimes even hitting Terraform’s execution timeouts. That meant splitting deployments into batches, manually tracking progress, and adding friction to what should’ve been a quick rollout.Pipeline deployments now complete in under 0.5 seconds each. Whether you’re launching 10 pipelines or 100+, they’ll deploy in a fraction of the time — keeping Terraform applies well within limits and eliminating the need for batching or manual retries.Example: One customer managing 150+ pipelines can now spin up 40 new pipelines in under 20 seconds, with their largest rollouts finishing in minutes.Why this matters:

Keeps Terraform applies under execution limits — no more timeout failures
Eliminates the need for splitting deployments into smaller batches
Cuts large-scale rollout time to seconds/minutes
Frees engineers from manual progress tracking and retries

Flush Metrics Now Available in Analytics

For teams replicating high-volume data into destinations like Snowflake, BigQuery, or Postgres, setting the right flush rules is key to balancing freshness, cost, and performance. But without visibility, tuning those rules can feel like guesswork.Flush rules let you control when data gets written from Artie’s streaming buffer to your destination — based on time intervals, row counts, or byte thresholds. With Flush Metrics, you now get a clear view into how those rules are performing.Take a healthtech team syncing MySQL to BigQuery: they’ve configured a 60-second or 1MB flush rule. Now, they can see exactly how often data flushes, what triggered it, and how long it took — helping them optimize for cost and latency without guessing.

Why this matters:

Tune pipelines with data — not guesswork
Validate whether your flush rules are hitting SLAs
Optimize for warehouse costs by spotting over-frequent writes
Troubleshoot delays and fine-tune performance in seconds

Head to the Analytics dashboard or read the docs for how to get started. Need help tuning? We’re here.

New Destination: Postgres

Not every use case belongs in a warehouse. Teams often need to move transactional data into Postgres to power real-time APIs, partner-facing systems, or operational dashboards — without the complexity of Snowflake or the fragility of DIY CDC scripts.Until now, Artie’s destinations focused on analytics platforms. But operational systems matter too — and we’re making sure you’re covered.You can now stream changes from your source databases directly into Postgres — just like any other Artie pipeline. It’s fully managed, fault-tolerant, and handles schema changes and backfills automatically.Some teams are already using it to power internal tools by syncing MySQL to Postgres — skipping the warehouse entirely. Others are using Postgres-to-Postgres replication to isolate production workloads or build live replicas for disaster recovery.Same reliability. New destination.Why this matters:

Power real-time APIs and dashboards without a warehouse
Eliminate fragile CDC scripts with a fully managed solution
Sync across Postgres instances to isolate workloads or support disaster recovery
Handle schema changes and backfills automatically — no maintenance required

Curious if Postgres fits your use case? Check the docs or reach out — we’ll help you get started.

July 2025

Self-Serve DynamoDB Backfills

Backfills are a critical step in onboarding new pipelines — especially when you’re working with historical data in DynamoDB. Until now, Artie handled that part for you, kicking off a table export behind the scenes. That worked fine — unless something broke. If your AWS role didn’t have the right permissions or something else went sideways, users were left guessing.Now, that guesswork is gone.When setting up a DynamoDB pipeline, you’ll now see a guided flow in the UI that helps you kick off a backfill. You can export the table directly from your AWS account — or select an existing export if you’ve already started one manually.

No more opaque failures. If something’s wrong (like missing permissions), you’ll see it immediately and can fix it yourself — without waiting or wondering.Why this matters:

DynamoDB backfills are now fully transparent and user-controlled
Errors (like missing permissions) are surfaced immediately for faster fixes
Reuse recent exports — no need to start from scratch
Smoother, more reliable onboarding for new DynamoDB pipelines

Setting up a new DynamoDB pipeline? You’ll see the new backfill flow automatically. Questions? Reach out — we’re here to help.

Parallel Segmented Backfills for Postgres

CTID-based backfills are fast and efficient — especially for large, append-only Postgres tables. They scan directly by physical row location, often outperforming logical queries in stable datasets.But CTIDs come with tradeoffs: they’re slow to initialize for large tables, fragile in dynamic tables where rows update or move, and they can time out in environments with aggressive statement_timeout settings.For teams working with massive, constantly changing Postgres tables, these limitations can stall backfill progress or create reliability risks.Parallel Segmented Backfills offer an alternative path. Instead of relying on CTID, Artie slices tables into logical row segments based on integer primary keys — then parallelizes the work across those chunks.The result: similar performance to CTID backfills, but with stronger guarantees in dynamic environments.We recently helped a customer backfill 8 billion rows in an actively updated Postgres table. CTID-based scans kept timing out and drifting. With Parallel Segmented Backfill, we split the workload across logical row ranges and completed the job — no timeouts, no skipped rows, no guesswork.Why this matters:

Resilient to updates and vacuuming — row movement doesn’t break backfills
Offers CTID-level performance with better reliability under load
Avoids statement_timeout failures in large or busy tables
Makes backfill behavior predictable and tunable

Improved DDL Support for Cell-Based Architectures

In environments with multiple isolated databases — like production, staging, and dev — schema drift is a persistent risk. Columns added in one cell might not appear in another unless there’s active data flowing through. That means teams looking at the “same” table in Snowflake could be seeing different structures, leading to confusion, bugs, and broken dashboards.This became especially painful for teams whose QA and Dev environments receive little to no traffic. With our previous behavior, tables wouldn’t update unless a row changed — leaving environments out of sync.Artie now supports schema alignment across environments.We’ve introduced a new opt-in job that automatically checks and syncs table schemas across environments — even when there are no row changes. If a new column shows up in production, it’ll get added to dev and staging too, so all environments stay aligned.This feature ensures you get consistent schemas, no matter how much (or little) traffic a database gets.Why this matters

Guarantees column consistency across environments (prod, staging, dev)
Eliminates silent schema drift in low-traffic databases
Supports cell-based and single-tenant architectures out of the box
Reduces debugging time and improves trust in test environments

External Stage Support for Snowflake

Some teams need more control over where their data goes — and how it gets there. Maybe it’s for compliance. Maybe audit. Or maybe they just don’t want Snowflake touching their data until the very last step.By default, Artie loads delta files into Snowflake using internal staging before merging them into the target table. That worked fine for most workflows — but some teams need an extra layer of control over how data flows through their environment.You can configure Artie to write delta files to a Snowflake external stage — like your own S3 bucket — and we’ll read from there when applying changes to your target table.This gives organizations — like federal agencies using Snowflake Gov Cloud — the ability to use an external stage in their own environment, keeping data fully under their control for things like internal review, validation, or security scanning before deciding to merge into Snowflake.Same fully managed sync. Just with the files landing in your environment first.Why this matters✅ You keep full visibility into what’s being staged before it’s merged
✅ You can retain delta files for auditing or reprocessing — entirely on your terms
✅ You get tighter control over when data crosses trust boundariesThere’s no impact on performance. No extra cost. Just more flexibility, when you need it.Want to turn this on? Let us know — we’ll help you get set up.

June 2025

Column Control: Include, Exclude, Hash

Not every column needs to make it to your warehouse.

Some fields are sensitive. Some are noisy. Some just don’t belong anywhere near analytics.
Now, you can decide exactly what gets replicated — and what doesn’t — with Artie’s expanded column-level controls.Here’s what’s now possible, per column:
✅ Inclusion — define an allowlist. Only replicate what you explicitly approve; otherwise, ignore
🚫 Exclusion — let most of the table through, but block the columns you don’t want downstream
🔐 Hashing — keep the structure, mask the value. Track fields like user IDs, without exposing data

Why this matters:

This isn’t just a cleanup job. It’s control over what leaves prod.InclusionSometimes, it’s not about removing sensitive fields — it’s about only sending the ones you trust. Inclusion rules flip the default: instead of replicating everything and hoping exclusions or hashes catch the risky stuff, you define exactly what gets through — and block the rest.What this means:

Safer by default — no surprises when new columns show up
Compliance-friendly — ideal for PII and financial data
Cleaner data — only the fields analytics and ML teams actually need

ExclusionMost of the table is fine. But those one or two fields? No thanks.
Exclusion rules let you drop what doesn’t belong — without touching your schema.Use it when:

You’re skipping internal metadata, debug fields, or legacy junk
You want to trim without breaking things
You’re migrating slowly and need a guardrail, not a wall

HashingSome fields need to be trackable — but not readable.
Hashing keeps them in your pipeline without exposing what’s inside.Reach for hashing when:

You want to track user behavior across systems without exposing identity
You’re debugging and need to confirm values match across systems — without logging sensitive data
You’re sharing a warehouse and want to prevent exposing raw PII to teams that don’t need it
You only need to know whether or not a value has changed

Want to configure it?
Column-level rules are set at the source. This guide explains where they belong and why.

CDC for Tables Without Primary Keys

Some tables are weird. No primary key (PK), maybe just a unique index or some composite hack someone added in 2017. Until now, those were off-limits for replication.You can now override PK requirements by specifying a unique index — including composite indexes. Artie will respect the exact column order to ensure optimal performance.Why index-based PK overrides matter:Not every table has a clean PK. Some use unique indexes or composite keys that aren’t formally declared as PKs. Until now, these tables were difficult (or impossible) to replicate. This change addresses one of the most common blockers for CDC at scale.What’s changed:

PK override: Define row identity with a unique index
Use composite keys — even if unofficial or unenforced
Preserve the exact index column order – it affects how changes are captured and impacts query performance during replication (e.g., email, account_id, created_at)

This unlocks flexible replication for legacy systems, denormalized tables, and high-volume sources — without compromising performance.When to use index-based keys:

Your table lacks a formal PK, but has a unique constraint or index
You rely on composite keys to identify rows
You’re dealing with legacy systems or data models that weren’t built with CDC in mind

How to enable key overrides:Reach out to enable key overrides — we’ll help define your index logic and validate it during setup.

Backfill Tuning: Picking the Right Batch Size

You can now control how many rows Artie processes at a time during backfills. The default is now 25,000 rows per chunk (up from 5,000), but you can tune this based on performance vs. load tradeoffs.Why backfill batch size matters:Backfills aren’t one-size-fits-all. Some teams want speed. Others are sensitive to database load and tiptoeing around a production DB at 2am. Until now, everyone got the same batch size of 5,000 rows per chunk. Now you can tune backfills to match your style:

The default is 25,000 rows — we benchmarked a bunch of sizes. 25,000 rows won out. So that’s our new default
You have control — adjust the batch size to fit your environment

How to tune batch size based on your workload:

Speed up backfills: Larger chunks = fewer queries, can improve throughput, but overly large chunks can backfire. It’s about finding balance.
Reduce DB load: Smaller chunks = faster queries, lower impact on source

Need help tuning batch size?If you’re unsure what batch size is right for your workload, reach out — we’ll help you tune it.

Read Once and Write to Multiple Destinations

You can now sync data from a single database to multiple destinations — all from the same connector.Why this matters:

Reduce load on production databases by avoiding duplicate reads and minimizing replication slot overhead
Ability to fan out to multiple tools — e.g., write to both Snowflake and Redshift
Ability to support diverse use cases in parallel — analytics, ML, real-time alerting

This feature is designed for organizations that:

Operate across multiple data platforms
Serve many internal teams with different tools
Need to scale data infrastructure without increasing operational burden

If you’re planning a multi-destination architecture, we’d be happy to help — just reach out.

Iceberg Support Using S3 Tables

This launch adds something big: support for Apache Iceberg using S3 Tables.Artie customers can now:

Stream high-volume datasets into Iceberg-backed tables stored on S3
Use S3 Tables’ fully managed catalog, compaction, and snapshot management
Query efficiently with Spark SQL (via EMR + Apache Livy) without wrestling with cluster glue
Get up to 3x faster query performance thanks to automatic background compaction

Why is Iceberg a big deal? Because it solves what’s frustrating and limiting about traditional S3-based data lakes. Hive tables are rigid and brittle, with no snapshotting or time travel. Delta Lake is powerful but tied to the Databricks ecosystem. Plain S3 file storage? No metadata layer, no transactions, no query optimizations.Instead, Iceberg gives you a fully open, cloud-native table format with smooth schema evolution, hidden partitions, snapshot isolation, and time-travel queries – all with broad engine support (Spark, Trino, Flink, Presto, Hive).We’re excited about this because it means Artie customers can confidently move massive data volumes without needing to hand-build the plumbing – Iceberg and S3 Tables handle schema changes, partitioning, compaction, and snapshot management behind the scenes, so the system scales cleanly without brittle, custom workflows.📚 Want to set up Iceberg-backed pipelines? Docs to get started: https://artie.com/docs/destinations/iceberg/s3tables

May 2025

S3 Iceberg destination (Beta)

S3 Iceberg is now available in beta! This new destination uses AWS’s recently released S3 Tables support, allowing you to replicate directly into Apache Iceberg tables backed by S3. It’s a big unlock for teams building modern lakehouse architectures on open standards.

Column Inclusion Rules

You can now define an explicit allowlist of columns to replicate - ideal for PII or other sensitive data. This expands our column-level controls alongside column exclusion and hashing. Only the fields you specify get replicated. Everything else stays out.

Autopilot for New Tables

Stop manually hunting for new tables in your source DB. Autopilot finds and syncs them for you - zero config required. Turn it on via:Deployment → Destination Settings → Advanced Settings → “Auto-replicate new tables”

Data Quality: Rows Affected Checks

To further enhance the data integrity built into our pipeline, we’ve added another guardrail: verifying the number of rows affected during each database operation.For example, during merge steps (such as in Snowflake), we confirm ROWS_LOADED from copy commands and validate the totals for inserted, updated, or deleted rows. This approach reinforces the robustness of our data replication process and it’s another way we catch issues early and ensure replication integrity.

Read Once, Write Many

We recently launched the ability to read-once and write to multiple destinations. This means you no longer need multiple replication slots on your source database.For example, by reading data just once from your Postgres instance and simultaneously replicating it to Snowflake and Redshift, you reduce database overhead and simplify replication architecture.‍

Multi-Data Plane Support

Artie now supports hosting pipelines across multiple data planes, whether you’re on our cloud or using your own (BYOC) infrastructure.For example, run one pipeline from Postgres to Snowflake in AWS US-East-1 and another from MySQL to Snowflake in AWS US-West-2.

Oracle Fan-in

With our Oracle Fan-in feature, you can now easily replicate data from thousands of Oracle sources - without painful manual setups or infrastructure overload. Fan-in reduces your Kafka topic sprawl, lowers infrastructure costs, and simplifies real-world, complex data replication.

​Improved Pipeline Overview

​SQL Server Change Tracking

​Native PagerDuty Paging

​Supporting Amazon Keyspaces (Cassandra) as a Source

​Displaying Redshift Views & DDL in Catalog

​Soft Partitioning

​Displaying Unreadable Schemas in Postgres

​View Only Role

​Schema Exclusion Rules for Postgres Fan-In

​Postgres: Publication Support Via Partition Root

​Backfill by Schema for Fanned-In Tables

​Full Source Table Name Support

​Pending Status for Pipelines

​Consolidated Pipeline Change Notifications

​New Data Type Support for Postgres

​Static Columns

​Source Metadata Columns

​Unifying Tables Across Schemas

​Microsoft Teams Notifications

​Large JSON Support for Redshift

​Specifying Snowflake Roles

​Sub-Second Pipeline Deployment

​Flush Metrics Now Available in Analytics

​New Destination: Postgres

​Self-Serve DynamoDB Backfills

​Parallel Segmented Backfills for Postgres

​Improved DDL Support for Cell-Based Architectures

​External Stage Support for Snowflake

​Column Control: Include, Exclude, Hash

​Why this matters:

​CDC for Tables Without Primary Keys

​Backfill Tuning: Picking the Right Batch Size

​Read Once and Write to Multiple Destinations

​Iceberg Support Using S3 Tables

​S3 Iceberg destination (Beta)

​Column Inclusion Rules

​Autopilot for New Tables

​Data Quality: Rows Affected Checks

​Read Once, Write Many

​Multi-Data Plane Support

​Oracle Fan-in

Improved Pipeline Overview

SQL Server Change Tracking

Native PagerDuty Paging

Supporting Amazon Keyspaces (Cassandra) as a Source

Displaying Redshift Views & DDL in Catalog

Soft Partitioning

Displaying Unreadable Schemas in Postgres

View Only Role

Schema Exclusion Rules for Postgres Fan-In

Postgres: Publication Support Via Partition Root

Backfill by Schema for Fanned-In Tables

Full Source Table Name Support

Pending Status for Pipelines

Consolidated Pipeline Change Notifications

New Data Type Support for Postgres

Static Columns

Source Metadata Columns

Unifying Tables Across Schemas

Microsoft Teams Notifications

Large JSON Support for Redshift

Specifying Snowflake Roles

Sub-Second Pipeline Deployment

Flush Metrics Now Available in Analytics

New Destination: Postgres

Self-Serve DynamoDB Backfills

Parallel Segmented Backfills for Postgres

Improved DDL Support for Cell-Based Architectures

External Stage Support for Snowflake

Column Control: Include, Exclude, Hash

Why this matters:

CDC for Tables Without Primary Keys

Backfill Tuning: Picking the Right Batch Size

Read Once and Write to Multiple Destinations

Iceberg Support Using S3 Tables

S3 Iceberg destination (Beta)

Column Inclusion Rules

Autopilot for New Tables

Data Quality: Rows Affected Checks

Read Once, Write Many

Multi-Data Plane Support

Oracle Fan-in