PostgreSQL query runs faster with index scan, but engine chooses hash join

Question

The query:

SELECT "replays_game".*
FROM "replays_game"
INNER JOIN
 "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 50027

If I set SET enable_seqscan = off, then it does the fast thing, which is:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..27349.80 rows=3395 width=72) (actual time=28.726..65.056 rows=3398 loops=1)
   ->  Index Scan using replays_playeringame_player_id on replays_playeringame  (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.019..2.412 rows=3398 loops=1)
         Index Cond: (player_id = 50027)
   ->  Index Scan using replays_game_pkey on replays_game  (cost=0.00..5.41 rows=1 width=72) (actual time=0.017..0.017 rows=1 loops=3398)
         Index Cond: (id = replays_playeringame.game_id)
 Total runtime: 65.437 ms

But without the dreaded enable_seqscan, it chooses to do a slower thing:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=7330.18..18145.24 rows=3395 width=72) (actual time=92.380..535.422 rows=3398 loops=1)
   Hash Cond: (replays_playeringame.game_id = replays_game.id)
   ->  Index Scan using replays_playeringame_player_id on replays_playeringame  (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.020..2.899 rows=3398 loops=1)
         Index Cond: (player_id = 50027)
   ->  Hash  (cost=3668.08..3668.08 rows=151208 width=72) (actual time=90.842..90.842 rows=151208 loops=1)
         Buckets: 1024  Batches: 32 (originally 16)  Memory Usage: 1025kB
         ->  Seq Scan on replays_game  (cost=0.00..3668.08 rows=151208 width=72) (actual time=0.020..29.061 rows=151208 loops=1)
 Total runtime: 535.821 ms

Here are the relevant indexes:

Index "public.replays_game_pkey"
 Column |  Type   | Definition
--------+---------+------------
 id     | integer | id
primary key, btree, for table "public.replays_game"

Index "public.replays_playeringame_player_id"
  Column   |  Type   | Definition
-----------+---------+------------
 player_id | integer | player_id
btree, for table "public.replays_playeringame"

So my question is, what am I doing wrong that Postgres is mis-estimating the relative costs of the two ways of joining? I see in the cost estimates that it thinks the hash-join will be faster. And its estimate of the cost of the index-join is off by a factor of 500.

How can I give Postgres more of a clue? I did run a VACUUM ANALYZE immediately before running all of the above.

Interestingly, if I run this query for a player with a smaller # of games, Postgres chooses to do the index-scan + nested-loop. So something about the large # of games tickles this undesired behavior where relative estimated cost is out of line with actual estimated cost.

Finally, should I be using Postgres at all? I don't wish to become an expert in database tuning, so I'm looking for a database that will perform reasonably well with a conscientious developer's level of attention, as opposed to a dedicated DBA. I am afraid that if I stick with Postgres I will have a steady stream of issues like this that will force me to become a Postgres expert, and perhaps another DB will be more forgiving of a more casual approach.

A Postgres expert (RhodiumToad) reviewed my full database settings (http://pastebin.com/77QuiQSp) and recommended set cpu_tuple_cost = 0.1. That gave a dramatic speedup: http://pastebin.com/nTHvSHVd

Alternatively, switching to MySQL also solved the problem pretty nicely. I have a default installation of MySQL and Postgres on my OS X box, and MySQL is 2x faster, comparing queries that are "warmed up" by repeatedly executing the query. On "cold" queries, i.e. the first time a given query is executed, MySQL is 5 to 150 times faster. The performance of cold queries is pretty important for my particular application.

The big question, as far as I'm concerned, is still outstanding -- will Postgres require more fiddling and configuration to run well than MySQL? For example, consider that none of the suggestions offered by the commenters here worked.

I take it player_id is not a unique column in the replays_playeringame table? What's the distribution of values like? — Neil, May 17 '12 at 21:12
What are the settings for {work_mem, shared_buffers, effective_cache_size} ? How many NULL values are there for "public.replays_playeringame_player_id" ? How many unique FK's ? (oh, that's what Neil said) — wildplasser, May 17 '12 at 21:12
@Neil, player_id is not unique within replays_playeringame. replays_playeringame is a many-to-many join table, with indexed fields player_id and game_id. The distribution of values is uneven -- the most active player is in 3,000 games, while the median player is probably in 2 games. — dsjoerg, May 17 '12 at 21:19
128M is small (for a current machine) for effective_cache_size. Try setting it to (total_memory - memory_used_by_all_processes) You did not alter the other cost constants? Also note that the row estimates are almost correct. — wildplasser, May 17 '12 at 21:28
"I'm afraid I will have a steady stream of issues like this": you will have that with any database. Don't believe anyone who claims that a DBMS does not need attention or that it always runs every statement with the ultimately best execution plan. — a_horse_with_no_name, May 17 '12 at 22:34

sayap · Answer 1 · 2012-05-17 22:29:39Z

My guess is that you are using the default random_page_cost = 4, which is way too high, making index scan too costly.

I try to reconstruct the 2 tables with this script:

CREATE TABLE replays_game (
    id integer NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE replays_playeringame (
    player_id integer NOT NULL,
    game_id integer NOT NULL,
    PRIMARY KEY (player_id, game_id),
    CONSTRAINT replays_playeringame_game_fkey
        FOREIGN KEY (game_id) REFERENCES replays_game (id)
);

CREATE INDEX ix_replays_playeringame_game_id
    ON replays_playeringame (game_id);

-- 150k games
INSERT INTO replays_game
SELECT generate_series(1, 150000);

-- ~150k players, ~2 games each
INSERT INTO replays_playeringame
select trunc(random() * 149999 + 1), generate_series(1, 150000);

INSERT INTO replays_playeringame
SELECT *
FROM
    (
        SELECT
            trunc(random() * 149999 + 1) as player_id,
            generate_series(1, 150000) as game_id
    ) AS t
WHERE
    NOT EXISTS (
        SELECT 1
        FROM replays_playeringame
        WHERE
            t.player_id = replays_playeringame.player_id
            AND t.game_id = replays_playeringame.game_id
    )
;

-- the heavy player with 3000 games
INSERT INTO replays_playeringame
select 999999, generate_series(1, 3000);

With the default value of 4:

game=# set random_page_cost = 4;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1)
   Hash Cond: (replays_game.id = replays_playeringame.game_id)
   ->  Seq Scan on replays_game  (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1)
   ->  Hash  (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 106kB
         ->  Bitmap Heap Scan on replays_playeringame  (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1)
               Recheck Cond: (player_id = 999999)
               ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1)
                     Index Cond: (player_id = 999999)
 Total runtime: 110.621 ms

After lowering it to 2:

game=# set random_page_cost = 2;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1)
   ->  Bitmap Heap Scan on replays_playeringame  (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1)
         Recheck Cond: (player_id = 999999)
         ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1)
               Index Cond: (player_id = 999999)
   ->  Index Scan using replays_game_pkey on replays_game  (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000)
         Index Cond: (id = replays_playeringame.game_id)
 Total runtime: 28.542 ms
(8 rows)

If using SSD, I would lower it further to 1.1.

As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.

If anyone is interested, I once encountered a similar issue, posted to the postgresql mailing list, and a long discussion was spawned: mail-archive.com/[email protected]/msg41005.html — sayap, May 17 '12 at 22:46
that discussion was effectively throttled by Tom Lane stating that effective_cache_size also affects the decision to use index scans. BTW: nice test setup. But I am afraid your strategy succeeds for the wrong reasons. — wildplasser, May 17 '12 at 23:13
Also note that the ratio between estimated and actual time is still wrong by a factor of ~40. You are just lucky because your OS has more buffers than the optimiser assumed. (this will alter once the tables get bigger (or the memory pressure higher) — wildplasser, May 17 '12 at 23:31
@wildplasser: There is no estimated time in the EXPLAIN ANALYSE output. — sayap, May 18 '12 at 4:25
I ran some tests, and setting random_page_cost to {4,2,1}. Results were indeed like this. Until I scaled-up the data by a factor of ten (both total size and the hard-hitter): now the chosen plan was actually the best, even for random_page_cost=4. My explanation of this is that the original query is executing fully in core, and that the disk-I/O cost factors loose grip on reality for such a small footprint. (they do affect the query plan, obviously) BTW: setting work_mem to a lower value has a similar effect on the generated plan. a lower fraction of hard-hitters yields better plans. — wildplasser, May 18 '12 at 12:17

wildplasser · Answer 2 · 2013-11-23 15:09:05Z

I ran sayap's testbed-code (Thanks!) , with the following modifications:

code is run four times with random_page_cost set to 8,4,2,1; in that order. (the cpc=8 is intended to prime the disk-buffer-cache)
The test is repeated with a reduced (1/2,1/4,1/8) fraction of the hard-hitters (respectively: 3K, 1K5,750 and 375 hardhitters; the rest of the records is kept unchanged.
These 4*4 tests are repeated with a lower setting (64K, the minimum) for work_mem.

After this run, I did the same run, but scaled up tenfold: with 1M5 records (30K hard-hitters)

Currently, I am running the same test with a hundred-fold scale-up, but the initialisation is rather slow...

Results The entries in the cells are the total time in msec plus a string that denotes the chosen queryplan. (only a handfull of plans occur)

Original 3K / 150K  work_mem=16M

rpc     |       3K      |       1K5     |       750     |       375
--------+---------------+---------------+---------------+------------
8*      | 50.8  H.BBi.HS| 44.3  H.BBi.HS| 38.5  H.BBi.HS| 41.0  H.BBi.HS
4       | 43.6  H.BBi.HS| 48.6  H.BBi.HS| 4.34  NBBi    | 1.33  NBBi
2       | 6.92  NBBi    | 3.51  NBBi    | 4.61  NBBi    | 1.24  NBBi
1       | 6.43  NII     | 3.49  NII     | 4.19  NII     | 1.18  NII


Original 3K / 150K work_mem=64K

rpc     |       3K      |       1K5     |       750     |       375
--------+---------------+---------------+---------------+------------
8*      | 74.2  H.BBi.HS| 69.6  NBBi    | 62.4  H.BBi.HS| 66.9  H.BBi.HS
4       | 6.67  NBBi    | 8.53  NBBi    | 1.91  NBBi    | 2.32  NBBi
2       | 6.66  NBBi    | 3.6   NBBi    | 1.77  NBBi    | 0.93  NBBi
1       | 7.81  NII     | 3.26  NII     | 1.67  NII     | 0.86  NII


Scaled 10*: 30K / 1M5  work_mem=16M

rpc     |       30K     |       15K     |       7k5     |       3k75
--------+---------------+---------------+---------------+------------
8*      | 623   H.BBi.HS| 556   H.BBi.HS| 531   H.BBi.HS| 14.9  NBBi
4       | 56.4  M.I.sBBi| 54.3  NBBi    | 27.1  NBBi    | 19.1  NBBi
2       | 71.0  NBBi    | 18.9  NBBi    | 9.7   NBBi    | 9.7   NBBi
1       | 79.0  NII     | 35.7  NII     | 17.7  NII     | 9.3   NII


Scaled 10*: 30K / 1M5  work_mem=64K

rpc     |       30K     |       15K     |       7k5     |       3k75
--------+---------------+---------------+---------------+------------
8*      | 729   H.BBi.HS| 722   H.BBi.HS| 723   H.BBi.HS| 19.6  NBBi
4       | 55.5  M.I.sBBi| 41.5  NBBi    | 19.3  NBBi    | 13.3  NBBi
2       | 70.5  NBBi    | 41.0  NBBi    | 26.3  NBBi    | 10.7  NBBi
1       | 69.7  NII     | 38.5  NII     | 20.0  NII     | 9.0   NII

Scaled 100*: 300K / 15M  work_mem=16M

rpc     |       300k    |       150K    |       75k     |       37k5
--------+---------------+---------------+---------------+---------------
8*      |7314   H.BBi.HS|9422   H.BBi.HS|6175   H.BBi.HS| 122   N.BBi.I
4       | 569   M.I.sBBi| 199   M.I.sBBi| 142   M.I.sBBi| 105   N.BBi.I
2       | 527   M.I.sBBi| 372   N.BBi.I | 198   N.BBi.I | 110   N.BBi.I
1       | 694   NII     | 362   NII     | 190   NII     | 107   NII

Scaled 100*: 300K / 15M  work_mem=64K

rpc     |       300k    |       150k    |       75k     |       37k5
--------+---------------+---------------+---------------+------------
8*      |22800 H.BBi.HS |21920 H.BBi.HS | 20630 N.BBi.I |19669  H.BBi.HS
4       |22095 H.BBi.HS |  284 M.I.msBBi| 205   B.BBi.I |  116  N.BBi.I
2       |  528 M.I.msBBi|  399  N.BBi.I | 211   N.BBi.I |  110  N.BBi.I
1       |  718 NII      |  364  NII     | 200   NII     |  105  NII

[8*] Note: the RandomPageCost=8 runs were only intended as a prerun to prime the disk buffer cache; the results should be ignored.

Legend for node types:
N := Nested loop
M := Merge join
H := Hash (or Hash join)
B := Bitmap heap scan
Bi := Bitmap index scan
S := Seq scan
s := sort
m := materialise

Preliminary conclusion:

"the working set" for the original query is too small: all of it fits in core, resulting in the cost of page fetches to be grossly overestimated. Setting RPC to 2 (or 1) "solves" this problem, but once the query is scaled-up, the page-costs become dominant, and RPC=4 becomes comparable or even better.
Setting work_mem to a lower value is another way to make the optimiser shift to index-scans (instead of hash+bitmap-scans). The differences I found are smaller than what Sayap reported. Maybe I have more effective_cache_size, or he forgot to prime the cache?
The optimiser is known to have problems with "skewed" distributions (and "skewed" or "peaked" multidimentional distributions) The testruns with 1/4 and 1/8 of the initial 3K/150K hardhitters show that this effect vanishes once the "peak" flattens out.
Something happens at the 2% boundary: the 3000/150000 gererate different (worse) plans, than those with <2% hardhitters. Could this be the granularity of the histograms ?

Massaging the optimizer - your specialty! :) So much information here! — Erwin Brandstetter, May 22 '12 at 11:25
The hardest part was getting all the ASCII-art right ;-) BTW, I really should do another run, dropping the cache in between tests. (in stead of priming it) — wildplasser, May 22 '12 at 11:29

Neil · Answer 3 · 2012-05-17 23:02:51Z

up vote 1 down vote

You might get a better execution plan using a multiple column (player_id, game_id) index on the replays_playeringame table. This avoids having to use a random page seek to look up the game id(s) for the player id.

answered May 17 '12 at 23:02

Neil

31.8k72746

I tried this, however it did not change the execution plan. – dsjoerg May 18 '12 at 16:59

add a comment |

asked	3 years ago
viewed	6399 times
active	3 days ago

current community

your communities

more stack exchange communities

PostgreSQL query runs faster with index scan, but engine chooses hash join

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql indexing query-optimization postgresql-performance or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

PostgreSQL query runs faster with index scan, but engine chooses hash join

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql indexing query-optimization postgresql-performance or ask your own question.

Linked

Related

Hot Network Questions