Depending on how many different data sets there are, one option would be to partition the tables per-dataset.
When a dataset is updated, BEGIN
a new transaction, TRUNCATE
the table, COPY
the new data into it, and COMMIT
. PostgreSQL has an optimisation where COPY
ing into a table that's been TRUNCATE
d in the same transaction does much less I/O if you're using wal_level = minimal
(the default).
If you cannot partition and truncate (say, if you're dealing with tens or hundreds of thousands of data sets, where there'd just be too many tables) you'll instead want to crank autovacuum up to run as much as it can, make sure you have good indexes on anything you delete based on, and be prepared for somewhat ordinary performance.
If you don't need crash safety - you don't mind your tables being empty after a system crash - you can also create your tables as UNLOGGED
, which will save you a huge amount of I/O cost.
If you don't mind having to restore the whole setup from a backup after a system crash you can go a step further and also setfsync=off
, which basically says to PostgreSQL "don't bother with crash safety, I have good backups and I don't care if my data is permanently and totally unrecoverable after a crash, and I'm happy to re-initdb
before I can use my database again".
I wrote some more about this in a similar thread on Stack Overflow about optimising PostgreSQL for fast testing; that mentions host OS tuning, separating WAL onto a different disk if you're not using unlogged
tables, checkpointer adjustments, etc.
There's also some info in the Pg docs for fast data loading and non-durable settings.