I have built a small inventory system using postgresql and psycopg2. Everything works great, except, when I want to create aggregated summaries/reports of the content, I get really bad performance due to count()'ing and sorting.
The DB schema is as follows:
CREATE TABLE hosts ( id SERIAL PRIMARY KEY, name VARCHAR(255) ); CREATE TABLE items ( id SERIAL PRIMARY KEY, description TEXT ); CREATE TABLE host_item ( id SERIAL PRIMARY KEY, host INTEGER REFERENCES hosts(id) ON DELETE CASCADE ON UPDATE CASCADE, item INTEGER REFERENCES items(id) ON DELETE CASCADE ON UPDATE CASCADE );
There are some other fields as well, but those are not relevant.
I want to extract 2 different reports: - List of all hosts with the number of items per, ordered from highest to lowest count - List of all items with the number of hosts per, ordered from highest to lowest count
I have used 2 queries for the purpose:
Items with host count:
SELECT i.id, i.description, COUNT(hi.id) AS count FROM items AS i LEFT JOIN host_item AS hi ON (i.id=hi.item) GROUP BY i.id ORDER BY count DESC LIMIT 10;
Hosts with item count:
SELECT h.id, h.name, COUNT(hi.id) AS count FROM hosts AS h LEFT JOIN host_item AS hi ON (h.id=hi.host) GROUP BY h.id ORDER BY count DESC LIMIT 10;
Problem is: the queries runs for 5-6 seconds before returning any data. As this is a web based application, 6 seconds are just not acceptable. The database is heavily populated with approximately 50k hosts, 1000 items and 400 000 host/items relations, and will likely increase significantly when (or perhaps if) the application will be used.
After playing around, I found that by removing the "ORDER BY count DESC" part, both queries would execute instantly without any delay whatsoever (less than 20ms to finish the queries).
Is there any way I can optimize these queries so that I can get the result sorted without the delay? I was trying different indexes, but seeing as the count is computed it is possible to utilize an index for this. I have read that count()'ing in postgresql is slow, but its the sorting that are causing me problems...
My current workaround is to run the queries above as an hourly job, putting the result into a new table with an index on the count column for quick lookup.
I use Postgresql 9.2.
Update: Query plan as ordered :)
EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;
Limit (cost=699028.97..699028.99 rows=10 width=21) (actual time=5427.422..5427.424 rows=10 loops=1)
-> Sort (cost=699028.97..699166.44 rows=54990 width=21) (actual time=5427.415..5427.416 rows=10 loops=1)
Sort Key: (count(hi.id))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=613177.95..697840.66 rows=54990 width=21) (actual time=3317.320..5416.440 rows=54990 loops=1)
-> Merge Left Join (cost=613177.95..679024.94 rows=3653163 width=21) (actual time=3317.267..5025.999 rows=3653163 loops=1)
Merge Cond: (h.id = hi.host)
-> Index Scan using hosts_pkey on hosts h (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.012..15.693 rows=54990 loops=1)
-> Materialize (cost=613177.95..631443.77 rows=3653163 width=8) (actual time=3317.245..4370.865 rows=3653163 loops=1)
-> Sort (cost=613177.95..622310.86 rows=3653163 width=8) (actual time=3317.199..3975.417 rows=3653163 loops=1)
Sort Key: hi.host
Sort Method: external merge Disk: 64288kB
-> Seq Scan on host_item hi (cost=0.00..65124.63 rows=3653163 width=8) (actual time=0.006..643.257 rows=3653163 loops=1)
Total runtime: 5438.248 ms
EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
LIMIT 10;
Limit (cost=0.00..417.03 rows=10 width=21) (actual time=0.136..0.849 rows=10 loops=1)
-> GroupAggregate (cost=0.00..2293261.13 rows=54990 width=21) (actual time=0.134..0.845 rows=10 loops=1)
-> Merge Left Join (cost=0.00..2274445.41 rows=3653163 width=21) (actual time=0.040..0.704 rows=581 loops=1)
Merge Cond: (h.id = hi.host)
-> Index Scan using hosts_pkey on hosts h (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.015..0.021 rows=11 loops=1)
-> Index Scan Backward using idx_host_item_host on host_item hi (cost=0.00..2226864.24 rows=3653163 width=8) (actual time=0.005..0.438 rows=581 loops=1)
Total runtime: 1.143 ms
Update: All the answers to this question is really good for learning and understanding how Postgres works. There does not seem to be any definite solution to this problem, but I really appreciate all the excellent answers you have provided, and I will use those in my future work with Postgresql. Thanks alot guys!
EXPLAIN ANALYZE
output of both the original query and the query with theORDER BY
clause removed? – willglynn Oct 16 '12 at 16:24EXPLAIN (ANALYZE, BUFFERS)
on the query/queries. Paste each plan into explain.depesz.com and link to it here, so we can see what's actually happening. Asking people to optimise queries without the plan and statistics estimates is not going to get you much more than educated guesswork. With PostgreSQL 9.2's index-only scans and the relatively small data set I don't see any reason these should be so slow. It'd also help to explain what computer you're running the queries on, what the storage is (RAID10? SSD? Single HDD?), RAM, and non-default postgresql.conf settings. – Craig Ringer Oct 17 '12 at 0:22postgresql.conf
settings? (You should have), etc? See immediately prior comment. – Craig Ringer Oct 17 '12 at 8:03