Optimizing PostgreSQL for transient data

Question

I have several tables with 100-300 columns of integer types each, that hold highly volatile data. The datasets are keyed by one or two primary keys, and when refresh occurs, the whole dataset is deleted and new data is inserted in one transaction. Dataset size is usually a few hundred rows, but can be up to several thousand rows in extreme cases. Refresh occurs once per second, and dataset updates for different keys are usually disjointed, so dropping and recreating the table is not feasible.

How do I tune Postgres to handle such load? I can use the latest and greatest version if that makes any difference.

Craig Ringer · Accepted Answer · 2013-07-13 03:58:19Z

up vote 4 down vote accepted

Depending on how many different data sets there are, one option would be to partition the tables per-dataset.

When a dataset is updated, BEGIN a new transaction, TRUNCATE the table, COPY the new data into it, and COMMIT. PostgreSQL has an optimisation where COPYing into a table that's been TRUNCATEd in the same transaction does much less I/O if you're using wal_level = minimal (the default).

If you cannot partition and truncate (say, if you're dealing with tens or hundreds of thousands of data sets, where there'd just be too many tables) you'll instead want to crank autovacuum up to run as much as it can, make sure you have good indexes on anything you delete based on, and be prepared for somewhat ordinary performance.

If you don't need crash safety - you don't mind your tables being empty after a system crash - you can also create your tables as UNLOGGED, which will save you a huge amount of I/O cost.

If you don't mind having to restore the whole setup from a backup after a system crash you can go a step further and also setfsync=off, which basically says to PostgreSQL "don't bother with crash safety, I have good backups and I don't care if my data is permanently and totally unrecoverable after a crash, and I'm happy to re-initdb before I can use my database again".

I wrote some more about this in a similar thread on Stack Overflow about optimising PostgreSQL for fast testing; that mentions host OS tuning, separating WAL onto a different disk if you're not using unlogged tables, checkpointer adjustments, etc.

There's also some info in the Pg docs for fast data loading and non-durable settings.

edited Jul 13 '13 at 3:58

answered Jul 13 '13 at 3:52

Craig Ringer
6,881821

Thanks for the partition tip, I never thought about using them in this case. As for unlogged tables - do you mean that they end up empty by default after system crash? It doesn't make any difference, I'm just curious. – Alex Tokarev Jul 15 '13 at 5:54

@AlexTokarev That's right; after PostgreSQL shuts down uncleanly (postmaster or a backend segfaults, system power-cycles suddenly, backend is SIGKILLed, etc) any UNLOGGED tables may be TRUNCATEd, so they're empty on start-up. They aren't truncated after a clean shutdown and restart, but you shouldn't rely on them being durable. – Craig Ringer Jul 15 '13 at 6:12

Thanks for the explanation. I don't need data safety for the tables in question, the data in them is transient and is refreshed from the source every second. Turning fsync off is not an option however, as there are other, more traditional tables in the same schema that do need to be safe and recoverable. Having the UNLOGGED option per table is just great. – Alex Tokarev Jul 15 '13 at 6:30

I'm looking at the partitioning doc and it looks like it could be an (almost) perfect solution to the problem. One question though: if I'm going to have a parent table for schema and child tables to hold the data, I'm going to query the data from the parent table, right? If a child table for that range exists, the query will return it, if not, it will return an empty dataset. In that case I can even drop and re-create child tables for each fresh data batch. Given the circumstances, what will be more effective, TRUNCATE or DROP/CREATE TABLE sequence? – Alex Tokarev Jul 15 '13 at 6:44

@AlexTokarev I'd recommend that you TRUNCATE, personally. DDL churn has its own costs. Since you're making changes with such a high frequently it'll be very important to make sure that you turn up autovacuum's aggressiveness on pg_catalog.pg_class and other system tables that might bloat under that workload. – Craig Ringer Jul 15 '13 at 7:07

show 2 more comments

asked	5 months ago
viewed	92 times
active	5 months ago

Explore our sites

Optimizing PostgreSQL for transient data

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql optimization database-tuning or ask your own question.

Hot Network Questions

Explore our sites

Optimizing PostgreSQL for transient data

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql optimization database-tuning or ask your own question.

Related

Hot Network Questions