Informix to Postgres, continuous data replication algorithm

Question

The master server is Informix, version varies from 9.40 to the latest, database is unlogged by design that can't be changed. Slave server is the latest PostgreSQL. Master and slave are separate machines, network latency is unpredictable. Master schema is statically defined, well known and does not change, so it's only the data that needs to be replicated. In the master, there are three types of tables:

Numeric data tables, usually one date column, one time column and 15-300 int columns keyed by 2-3 primary keys. The data is never changed, only added once in a set interval (15, 30, or 60 minutes) and deleted when the retention point is reached. Replication data set can be up to 80,000 rows but usually is in the range of hundreds. This data needs to be replicated one way, master to slave. There is about 30 tables of this type and they need to be replicated all at once and as fast as possible, typically in under one minute after new interval set has been committed to the master.
Mixed data tables, with date, time, int, and string types, 30-100 columns, again 2-3 primary keys. This data is also never changed, added continuously and is deleted when the retention point is reached. The data set is up to 100,000 rows per hour. One way replication is needed, master to slave. There are a few tables like that, less than 5 usually.
Mixed data tables, with int and string types, less than 10 columns, 2-3 primary keys. The data largely stays intact, with occasional additions, edits or deletions. The usual replication set size is unpredictable, but probably will be in low hundreds of rows. This data needs to be replicated both ways, as fast as possible. There are a few tables of this type, and they need to be synched independently.

I've been looking for an existing tool that could do what I need, but it looks like there is none that is open source. I'm probably going to write one for my needs, and I'm looking for advice from DB gurus on how to approach this task.

In my estimate, there's probably no single algorithm that would cover all the use cases so I may be in fact looking for two or three algorithms. Here's what I found so far:

Fire trigger on master changes, record row OIDs (does Informix have them?) to temp table, dump the changed rows to a file, transfer it and load up. Question: how to buffer the trigger? The master DB is unlogged (no transactions), so trigger will fire upon each INSERT. Additional strain on the master, not good.
Add a cron job on the slave that will pull latest date/time keys from the master, and if the data is newer, pull it. Problem: although the update interval is defined, in reality it's based on the data source clock (not master DB clock) which is guaranteed to vary from slave server clock. More of it, there can be several data sources, each with varying clocks, and the data needs to be replicated ASAP. The only way here that I see is to constantly poll the master from the slave, hoping that by the time the poll comes in, the data is all committed (no transactions, remember?). Kludgy, slow, not good.
Add Informix as foreign data wrapper in the Postgres and run queries directly instead of bothering with replication. Pros: simplicity. Cons: Informix connector seems to be in alpha stage, and the whole approach is an unknown factor at best.

I've been researching this topic for some time, and it seems that the core of the problem is the lack of transactions on the master side. If the master DB was logged, it would be much easier to replicate it, but without transactions the task suddenly becomes much more complicated. For one, how do I ensure that there are no dupes? Another one, how to avoid update loops in type 3 tables? Considering all that, how to make replication as fast-reacting as possible? I mean the delay between data update and sync start here, data transfer is another topic altogether.

Any input is appreciated.

Wow, so in Informix "unlogged" means "transactionless", not just "not crash safe" like it does in Pg. That makes things a lot harder. I was going to suggest that you look at porting trigger-based replication tools like PGQ and Londiste from Skytools, or maybe Bucardo, to support an Informix backend. That won't work if you don't have transactions, though. — Craig Ringer, Jul 13 '13 at 3:46
Personally I think I'd port the 1st DB over to PostgreSQL ;-) . Particularly given the work we've been doing on logical streaming replication (wiki.postgresql.org/wiki/BDR_User_Guide, still very alpha). — Craig Ringer, Jul 13 '13 at 3:48
@CraigRinger Unfortunately I can't port the master DB to Postgres, although I'd love to. Master(s) are legacy reporting systems that I need to aggregate the data from, and being legacy they can't be touched. I can add users, triggers and stored procs but doing anything big like schema changes is off the table. — Alex Tokarev, Jul 14 '13 at 19:11
Good luck, then. I suspect you're in a situation where there are too many overlapping and conflicting requirements - "change nothing", "unlogged", and "fast, reliable replication across two DB vendors". I think one or more of your requirements will have to give. — Craig Ringer, Jul 14 '13 at 23:31
@CraigRinger Well you got the picture right; I absolutely can not change the master in any significant way, because despite being legacy those systems are still actively used (and sold!) and they're made by a different vendor that just won't change a thing. David and Goliath situation, with me being on the tiny side. That said, if anything has to give it's "fast". I can live with relatively slow replication as long as it's 100% reliable. — Alex Tokarev, Jul 14 '13 at 23:37

Chris Travers · Answer 1 · 2013-11-01 08:07:05Z

If you can't change the master in any significant way you are going to have a heck of a time with any sort of replication. Your basic problem is that you have no real way to handle replicating changes in real time without tracking which changes have been replicated, and if you can't change the master, you can't add that. So the short answer is that replication is not a solution which can work for you. Given some of Informix's other features I would think twice about going about this as continuous replication.

This leads to other approaches. The big unknown factors are that networks may not be reliable enough to just link the databases. This could lead to transactions hanging while waiting for data off a high latency connection to all kinds of other problems. You might be able to get this to work with an odbc fdw and an informix provider or with DBI-Link and DBD::Informix, but this strikes me as a problem in your current environment. You could use these in a cron job to populate a second PostgreSQL server closer to your own location periodically, however and so I would not write the approach entirely off.

One way or another it seems to me you need to get a copy of the data to your PostgreSQL server. You may want to do an ETL job to import the data periodically. You may want to use a secondary postgresql server and FDW's or DBI-Link to pull in the data. But this is not likely to be real-time, it is not likely to be continuous.

The tl;dr is that your environment isn't really set up to do this. For my money I would recommend an ETL approach and accept that your slave will not be in sync with the master.

I'm not very familiar with the terminology; what do you mean by "doing an ETL job" to import the data? Please elaborate. — Alex Tokarev, Nov 1 '13 at 23:52
ETL is Export-Transform-Load, where you export data from one system, transform it to the other system and then load it. It is typically used when trying to connect transaction-processing systems with reporting systems as it gives you much more freedom in reporting design. — Chris Travers, Nov 2 '13 at 4:24

asked	1 year ago
viewed	253 times
active	1 year ago

current community

your communities

more stack exchange communities

Informix to Postgres, continuous data replication algorithm

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql informix database-replication or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Informix to Postgres, continuous data replication algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql informix database-replication or ask your own question.

Related

Hot Network Questions