Being an avid data warehouse user, I was often frustrated by the limitations around data latency between the data warehouse and the production data that sits within transactional databases.
Depending on where I worked, the data lag varied from multiple hours to days, with the lag exacerbated by the volume of data.
Given that data warehouse is a platform, there are may use cases that can be enabled or empowered by having a shorter data lag. I'll go over examples below.
Operations heavy companies typically have constantly changing business processes and ops people often leverage a variety of no-code tools such as Zapier, Typeform, Retool and others to keep up with process changes.
These tools can stack on top of each other and reference data in the data warehouse. The efficacy of the solution created from the no-code tools are then partially impacted by the data replication lag.
Within lifecycle marketing, it is common to purchase
marketing automation tools
such as Iterable, Braze, Klaviyo, etc.
Each of these tools have their own version of what a user model and events should look like, such that teams can create templates such as:
Hello {{first_name}}
!
Examples of additional user attributes that may be sent: - Paid marketing. Once the customer requests a ride, we'd like to send as many customer traits as possible to destinations like Google and Facebook so their algorithm can find more look-alikes. - Nurture campaigns. When a customer signs up on our website, we'd like to put them in a drip campaign that welcomes and onboards the customer. We'd like to reference dynamic fields like product iterations. - Did they do anything more than just sign up? -Did they play around with our platform?
Every company has a subset of tables that are critical to the business. These tables are typically used by multiple teams and are the source of truth for the company.
Having them replicated to data warehouses effortlessly and in real-time is a value multiplier.
Industry | Critical Tables |
---|---|
E-commerce | Orders, Customers, Products |
SaaS | Users, Accounts, Subscriptions |
Marketplaces | Orders, Customers, Products |
Real Estate | Inventory, Offers, Customers |
As many have chimed in here, it is not that real-time replication is not useful. It is extremely valuable, but it's often far too complex for any particular engineering team to dedicate resources to set up.
Also, it's really hard to maintain and streaming pipeline errors are extremely unforgiving. A typical data engineering team also has to maintain many other pipelines, so it's difficult to justify this level of investment.
Typically, companies are solving this problem today by doing the following:
SELECT * FROM table WHERE updated_at > last_synced_at
and syncing only the deltas.
Each of them has its obvious drawbacks: 1. Data dumps are extremely resource intensive and can take hours to complete. 2. Incremental syncing is error-prone, inability to record deletes and can also be slow. 3. Third-party tools are expensive, limited in their capabilities, not set up for scale and can be difficult to use.
As technologists, we believe that approaching zero replication lag between OLTP and OLAP databases should be the norm and widely accessible.
Artie enables OLTP data to be streamed continuously to the data warehouse and reduces replication lag from hours to days down to seconds. This will allow companies to unlock new use cases and empower their teams to make better decisions.
This is how Artie works under the hood:
To support this workload, Artie Transfer has the following features built in:
As we mentioned before, a big reason for low adoption of CDC replication is that it’s complex and requires a ton of engineering investment. We've worked tirelessly on making the onboarding experience seamless and intuitive. Simply enter your source details, highlight the tables you want to sync, enter your destination details, and we will spin up all the infrastructure and handle backfills. We built Artie so teams can set up CDC pipelines in minutes.