Soft Partitioning

Soft partitioning is a logical partitioning strategy implemented at the ingestion layer rather than relying on native database partitioning features. Instead of storing all rows for a given entity in a single large table, we split data into time-based “partition” tables and expose them through a unified view.

Currently Supported Destinations

  • Redshift
  • Snowflake (coming soon)

Why Soft Partitioning?

While some databases support native partitioning, others have limitations that make them ineffective for ingestion workloads:

PostgreSQL

  • Supports native partitioning with good performance
  • Well-suited for both ingestion and query workloads

BigQuery

  • Supports native partitioning and clustering, but partition pruning is most effective for query workloads rather than high-throughput ingestion.
  • Partitioning in BigQuery is typically based on a timestamp or date column, but frequent ingestion of small batches can lead to many partitions and higher costs.
  • Soft partitioning can help optimize ingestion performance and cost by controlling partition granularity and data routing at the ingestion layer.
  • For instructions on native partitioning, see our BigQuery Partitioning guide.

Snowflake

  • Has cluster keys that physically co-locate related rows in micro-partitions
  • Re-clustering is async and can be expensive (consumes credits outside of a warehouse)
  • Query pruning is less predictable
  • More optimized for query performance than ingestion throughput

Redshift

  • Has distribution keys (to control data placement across nodes) and sort keys (to allow pruning)
  • More useful for joins and analytical queries
  • Not well-suited for rapid ingestion and pruning for high throughput
Soft partitioning aims to bridge this gap by delivering:
  • Predictable ingestion performance
  • Explicit partition control
  • Simpler data management
  • Freedom from underlying database partitioning constraints

How It Works

Consider a table user_events with the following columns:
CREATE TABLE user_events (
  uuid UUID NOT NULL,
  event_name VARCHAR(255) NOT NULL,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMP NOT NULL,
  properties JSONB,
  PRIMARY KEY (uuid, created_at)
);

1. Dynamic Table Creation

The partition scheme can be by month or day, and is keyed off the created_at column. Artie will dynamically create tables for each partition period. For example, with monthly partitions:
user_events_2025_08
user_events_2025_09
user_events_2025_10
...etc

2. Automatic Data Routing

As new data arrives, Artie writes each row to the appropriate partition table based on its created_at timestamp. Future partitions (e.g., next 2-3 months) are pre-created and added to the unified view.

3. Unified View

Artie automatically maintains a view for user_events that unions all partitions and updates the definition as new partitions are created:
CREATE OR REPLACE VIEW user_events AS (
  SELECT
    uuid,
    event_name,
    user_id,
    created_at,
    properties,
    'user_events_default' AS artie_full_partition_name
  FROM user_events_default
  UNION ALL
  SELECT
    uuid,
    event_name,
    user_id,
    created_at,
    properties,
    'user_events_2025_08' AS artie_full_partition_name
  FROM user_events_2025_08
  UNION ALL
  SELECT
    uuid,
    event_name,
    user_id,
    created_at,
    properties,
    'user_events_2025_09' AS artie_full_partition_name
  FROM user_events_2025_09
  UNION ALL
  SELECT
    uuid,
    event_name,
    user_id,
    created_at,
    properties,
    'user_events_2025_10' AS artie_full_partition_name
  FROM user_events_2025_10
);

4. Compaction

As the number of partitions grows, Artie compacts old partitions by merging their data back into the _default table. You have the flexibility to specify how many previous partitions are retained before compaction is triggered.

Target Workloads

While soft partitioning can work for any datasets with a timestamp column, it’s particularly effective for:

Time-Series, Append-Only Data

  • Event streaming
  • IoT sensor data
  • Transaction logs
  • Application metrics

High-Volume Ingestion Pipelines

  • Where the cost of updating large tables is high
  • Merge workloads that often target recent slices of data
  • Scenarios requiring predictable ingestion performance

Benefits