PostgreSQL date query performance problems

Question

I have around 5 million rows in a postgres table. I'd like to know how many rows match start_time >= NOW(), but despite having an index on start_time the query is extremely slow (in the order of several hours).

EXPLAIN SELECT COUNT(*) FROM core_event WHERE start_time >= NOW();
 Aggregate  (cost=449217.81..449217.82 rows=1 width=0)
   ->  Index Scan using core_event_start_time on core_event  (cost=0.00..447750.83 rows=586791 width=0)
         Index Cond: (start_time >= now())

Here's the schema information for the table:

 id          | integer                  | not null default nextval('core_event_id_seq'::regclass)
 source      | character varying(100)   | not null
 external_id | character varying(100)   | 
 title       | character varying(250)   | not null
 location    | geometry                 | not null
 start_time  | timestamp with time zone | 
 stop_time   | timestamp with time zone | 
 thumb       | character varying(300)   | 
 image       | character varying(100)   | 
 image_thumb | character varying(100)   | 
 address     | character varying(300)   | 
 description | text                     | 
 venue_name  | character varying(100)   | 
 website     | character varying(300)   | 
 city_id     | integer                  | 
 category    | character varying(100)   | 
 phone       | character varying(50)    | 
 place_id    | integer                  | 
 image_url   | character varying(300)   | 
 event_type  | character varying(200)   | 
 hidden      | boolean                  | not null
 views       | integer                  | not null
 added       | timestamp with time zone |

I have indexes on the following fields:

city_id
external_id (unique)
location
location_id
place_id
start_time

Is there any easy way for me to speed up the query (eg. a partial index), or am I going to have to resort to partitioning the data by date?

Does that really apply? As I read the wiki-link, @StefanNch's link is relevant only when count is not restricted to a field used in an index. — user806549, Jan 12 '12 at 17:16
@StefanNch, I'm applying a where clause so that shouldn't be too relevant. — Ben Dowling, Jan 12 '12 at 17:31

Erwin Brandstetter · Answer 1 · 2012-01-13 00:41:16Z

up vote 1 down vote

Try adding a partial index like the following:

CREATE INDEX core_event_start_time_recent_idx ON core_event (start_time)
WHERE start_time >= '2011-01-12 0:0'::timestamptz

This will create a comparatively small index. Index creation will take some time, but queries like this one will be much faster thereafter.

SELECT count(*) FROM core_event WHERE start_time >= now();

The effectiveness of this index for queries against now() will degrade slowly over the course of time, depending on how many new rows are coming in. Update (= drop & create) the index with a more recent timestamp occasionally at off hours.
You could automate this with plpgsql function that you call per cronjob or pgAgent.

You might try and see if a running CLUSTER on the table improves things (if it doesn't go against other requirements in your db):

CLUSTER core_event USING core_event_start_time;

Yes, cluster on the full index, not the partial one. This will take a while and needs an exclusive lock, because it effectively rewrites the table. It also effectively vacuums the table fully. Read about it in the manual.

You also may want to increase the statistics target for core_event.start_time;

ALTER core_event ALTER start_time SET STATISTICS 1000; -- example value

The default is just 100. Then:

ANALYZE core_event;

Or course, all the usual performance stuff applies, too.

edited Jan 13 '12 at 0:41

answered Jan 12 '12 at 18:09

Erwin Brandstetter
109k11120213

I added the index, but from EXPLAIN ANALYZE it doesn't look like it's using it. It's doing a Bitmap Index Scan on core_event_start_time – Ben Dowling Jan 12 '12 at 18:54

@BenDowling: That's odd. I have a couple of indexes just like that and I verified it works for me (on version 8.4, 9.0 and 9.1). Should work for any timestamp after the one in the where clause of the index - provided most rows have a start_time older than that timestamp (or the index won't be any faster than a sequential table scan. In any case it would be chosen over an full index on start_time over all rows. Something does not add up here. Are you showing the full picture? How many rows (approx.) are before and after now() in your table? – Erwin Brandstetter Jan 12 '12 at 23:36

I'm using 8.4, and have around 4 million rows total, and 700,000 >= NOW(). I've just run ANALYZE core_event, and re-run explain and it's still not using the partial index. – Ben Dowling Jan 13 '12 at 8:23

@BenDowling: With that many rows > now(), I can imagine the query planner choosing a sequential scan, because most of the data pages have to be read anyway. But the full index core_event_start_time should never be chosen over the partial index core_event_start_time_recent_idx when a matching condition is given. – Erwin Brandstetter Jan 13 '12 at 15:32

It's clustered around start_time, could that be causing it to use that index instead? – Ben Dowling Jan 21 '12 at 7:51

| show 1 more comment

araqnid · Answer 2 · 2012-01-12 20:25:46Z

Do most of these columns get populated for each row? If so, the amount of disk that postgresql has to look at to test rows for liveness even after checking the index will be fairly large. Try for example creating a separate table that only has id and start_time:

create table core_event_start_time as select id, start_time from core_event;
alter table core_event_start_time add primary key(id);
alter table core_event_start_time add foreign key(id) references core_event(id);
create index on core_event_start_time(start_time);

Now see how long it takes to count IDs in core_event_start_time only. Of course, this approach will take up more buffer cache at the expense of space for your actual core_event table...

If it helps, you can add a trigger onto core_event to keep the auxiliary table updated.

(postgresql 9.2 will introduce "index only scans" which may help with this sort of situation, but that's for the future)

asked	2 years ago
viewed	971 times
active	2 years ago

current community

your communities

more stack exchange communities

PostgreSQL date query performance problems

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged sql database postgresql database-partitioning or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

PostgreSQL date query performance problems

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql database postgresql database-partitioning or ask your own question.

Related

Hot Network Questions