Skip to main content

How it works

is a critical part of Artie’s data pipeline that determines when and how data gets written to your destination.
1

Data buffering

  • Artie’s reading process will read changes from your source database and publish them to Kafka
  • Artie’s writing process will read messages from Kafka and write them to your destination
  • Messages are temporarily stored in memory and deduplicated based on primary key(s) or unique index
  • Multiple changes to the same record are merged to reduce write volume
2

Flush trigger evaluation

  • Artie continuously monitors three flush conditions
  • When any condition is met, a flush is triggered
  • Reading from Kafka pauses during the flush operation
3

Data loading

  • Buffered data is written to your destination in an optimized batch
  • After completion, Artie will commit the offset and resume reading from Kafka
  • The cycle repeats for continuous data flow

Conditions

Artie evaluates three conditions to determine when to flush data. Any one of these conditions will trigger a flush:

Time elapsed

Maximum time in seconds — Ensures data freshness even during low-volume periods

Message count

Number of deduplicated messages — Based on unique primary keys or unique index.

Byte size

Total bytes of deduplicated data — Actual payload size after deduplication

Setting optimal rules

The right flush configuration depends on your destination type, data volume, and latency requirements.
For transactional databases like PostgreSQL, MySQL, or SQL Server:

Recommended approach

Smaller, frequent flushes work well because:
  • Row-based storage handles individual record operations efficiently
  • Native UPSERT/MERGE operations minimize overhead
Example configuration:
  • Messages: 1,000-5,000 records
  • Bytes: 10-50 MB
  • Time: 30-60 seconds
For analytical databases like Snowflake, Databricks, BigQuery, or Redshift:
Setting the flush rules too low can hinder throughput and cause latency spikes:
  • Fixed overhead costs: Each flush has connection/metadata overhead that dominates processing time with small batches
  • Inefficient resource usage: OLAP systems are designed for large parallel operations, not frequent micro-operations
  • Storage and query degradation: Many small files hurt compression, increase metadata lookups, and trigger excessive compaction
  • Recommendation: For OLAP destinations, set higher row/byte limits and rely on time-based triggers

Recommended approach

Larger, less frequent flushes are optimal because:
  • Columnar storage benefits from batch processing
  • Reduced metadata overhead and better compression
  • More efficient query performance with fewer small files
Example configuration:
  • Messages: 25,000-500,000 records
  • Bytes: 50-500 MB
  • Time: 3-15 minutes
Note: We also have multi-step merge that can be enabled for tables that have a lot of write throughput and would like to have extremely large flush batches (1GB+).

Best practices

Start conservative

Begin with smaller flush values and increase based on observed performance and destination capabilities.

Validate through flush metrics

As you experiment and fine-tune the flush rules, you can see which rule triggered the flush as the reason in “Flush Count” graph from the analytics portal.

Monitor and adjust

Track flush frequency, batch sizes, and end-to-end latency to optimize over time.

Consider your SLA

Time threshold should align with your data freshness requirements and business SLAs.

Advanced

Flush reason in the analytics portal
I