Request Access

Select
Select

Your information is safe with us. We will handle your details in accordance with our Privacy Policy.

Request received! Our team will be in touch soon with next steps
Oops, something went wrong while submitting the form

Multi-Step Merge: Unlocking 3x Replication Throughput for Large Tables

Robin Tang
Robin Tang
Updated on
March 19, 2025
Product spotlight
Product spotlight
Product spotlight
Product spotlight

Merge operations are notoriously expensive as they require scanning and rewriting large volumes of data, which then results in higher latency for large tables. You can scale this by scaling up the compute layer – by adding more compute resources to scan and rewrite data files faster. However, this drives up compute costs and increases your total cost of ownership.

An alternative way is to increase the size of each merge which will allow more data to be processed in a single operation. This strategy not only optimizes on the inherent overhead but also enables us to replicate up to 3x more data without scaling up the compute layer. 

What is "multi-step merge"?

Multi-step merge (MSM) is a way for us to iteratively land data into an intermediary staging table. As such, we can then build up a large enough staging table before invoking a merge against the target table.

Customers could previously configure our merge frequency by specifying the following variables, and we would trigger a merge based on whichever came first:

  1. Flush time (in seconds)
  2. Number of rows1
  3. Data size2

Customers can now specify flush count, which controls how many times we land data into the staging table before merging into the target table.

Customers with MSM enabled, can now specify how often Artie should flush to the staging table before invoking a merge.

When and why should you use MSM?

MSM is ideal for large tables that receive a high volume of changes (at least a few billion per month). By aggregating these changes into larger merge batches, it minimizes the overhead. 

Key benefits are:

  1. Increased throughput. Can increase replication throughput by up to 3x by reducing the number of merge operations
  2. Cost efficiency. Avoid the need to scale up compute resources
  3. Reduced ingestion lag. Increased throughput to keep up with incoming writes and minimize latency between your source and destination

How you can enable MSM with Artie

We are doing a phased roll out. If you'd like early access to the feature, please get in touch with us at [email protected].

This is available under the table and deployment advanced settings.

[1],[2]:  If we are merging, then this value is de-duplicated. If we are appending (via history mode), then this is not de-duplicated.

AUTHOR
Robin Tang
Robin Tang
Table of contents

10x better pipelines.
Today.