What are best practices for incremental replications?

Question

I'm working on a tool to replicate data from several different source DBs (used by the same company, but different vendors and versions) to one centralized analysis platform (for the purpose of this question: just another database). We don't need history, only snapshots of selections of tables/schemas in the source databases. These are all production databases with a lot of updates in each window between replications.

The choice we have to make is whether to do incremental updates via transaction logs or just read the data from the actual tables. We can't change anything in the source databases, so things like adding a last_replicated timestamp are out of the question.

When searching for best practices, I found a lot of tools, but not much info about which method would be most efficient in which circumstances. My gut says just taking a snapshot and throwing away the older replication when the newer is done is the easiest way to do it, and implementing incremental updates through the transaction logs of different vendors is probably a pain and not that much faster per run. This would of course depend on the number of updates on the source database. But I can't find info to back this up.

Can anyone point me to relevant info?

(I am not looking for tools, since this functionality would be part of a suite we are building in-house.)

Definitely Oracle, Postgres, Greenplum, MS SQL, MySQL and MariaDB. Others might be added later. It's a big company where historically different departments were relatively free to make their own choices in IT stack. — venite, Oct 11 '21 at 08:31

J.D. · Answer 1 · 2021-10-11T11:51:19.747

Unfortunately what you need to support is just too broad of a selection to gather a specific answer on here. E.g. I've never even heard of Greenplum, and I'd wager 99% of the other users on here haven't either.

It will require a lot of work to set this up, and you may run into challenges with it not being possible to replicate from some of the lesser known systems, or at least not via traditional database features. You may end up writing a lot of application code in those cases instead.

Anyway my advice would be to centralize it to an enterprise database system that's been established. E.g. Microsoft SQL Server has a multitude of ways to pull data in from heterogeneous sources. Even simply, one can easily setup a Linked Server to almost any other data source, in Microsoft SQL Server. Alternatively, I'm sure Oracle also offers similar flexibility with connecting to heterogeneous data sources. Less established systems you'll likely find it harder to setup a proper means for ingesting the data from all the different sources you need to support.

And to answer your question, it will almost always be faster / less load on your servers to update the data incrementally, generally via the Transaction Log or similar means, rather than scheduling routine snapshotting of the data. But if the data is small and / or doesn't change regularly (so the schedule can occur less frequently) then that's a potential option to utilize too.

Thank you for taking the time to answer! I suspected there wouldn't be a general answer - that's probably why it was so hard to google.
I hadn't heard of Greenplum before I started working here either. Unfortunately I can't do anything about the stack. But fortunately I actually like writing lots of code :) — venite, Oct 14 '21 at 18:35

What are best practices for incremental replications?

1 Answers1