How to synchronize with a data warehouse

Hi there,

for in-depth data analysis we’re using a data warehouse which mirrors the ERPNext database using Meltano ETL (https://meltano.com/). Meltano supports full table, incremental and log-based synchronization.
Full table does not work as our database is too large, because of this we opted for incremental sync. This method utilizes the modified column on the DocType’s tables to run the incremental sync.

We’re facing two issues with this method

  • Frappe sometimes uses hard-delete which causes orphans in the warehouse
  • Frappe does not update the modified date in some cases, for instance if a document is renamed, which causes inconsistency

Does anyone of you also use database replication into a data warehouse? If yes, how do you keep data in sync?

Looking forward for your ideas
Patrick

Hi @patrick_aisler ,

For a Data Warehouse, I’m doing the following:

  1. MariaDB Replication to a different VPS device. This is purely for performance reasons. I don’t want my ETL queries targeting the Production ERP database, and potentially harming performance.

  2. ETL Tool (homemade) is reading from the MariaDB Replica, and writing to the data warehouse tables.

Deletes are being detected by comparing business key values in DW, versus the incoming IDs during an ETL run. If a business_key exists in the DW, but not the ERP? Then I assume the ERP deleted it.

Renaming is always problematic in ETL. Luckily, we rarely use the Rename function in the ERP. So it’s not something I’m trying to solve for (yet). Someday I solve by having the “Rename” function write to a new table, so I can keep track of Before vs. After. Then teach the ETL tool to leverage that information.

I’m assuming all Creation or Modified dates are unreliable, and not putting any particular DW logic around them.

2 Likes