4

I have a simple schema and query, but am experiencing consistent awful performance with certain parameters.

Schema

CREATE TABLE barcodes (
  id    integer PRIMARY KEY,
  value citext  NOT NULL
);
CREATE INDEX index_barcodes_on_value ON barcodes (value);

CREATE TABLE locations (
  id         integer PRIMARY KEY,
  barcode_id integer NOT NULL REFERENCES barcodes(id)
);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);

Query

EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;

Analysis

Limit  (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms

500+ ms for the number of entries in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Can I do anything to improve the performance?

Note:

Even stranger is the execution time depends on the data. In drafting this question I attempted to include seeded random data, but the seeds seem to be performant:

Seed

INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000)  FROM generate_series(1,60000) AS seed(id);

Analysis

Limit  (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms

I'm looking for help diagnosing the slow performance of the query. I can reproduce the problem - I just cannot replicate it with random seed data.

I am running PostgreSQL 9.6.2. Running VACUUM FULL on all the tables in the query didn't do anything.

1
  • What plan do you get if you don't include the LIMIT 50? Commented Mar 29, 2017 at 5:16

2 Answers 2

2

The reason for the poor performance is that the barcodes occurring earlier in the alphabet have been preferentially depleted from the location table.

It is easy to reproduce this with your random data seeds by just populating locations with twice as many records, and then deleting the "first" half of them:

INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000)
    FROM generate_series(1,120000) AS seed(id);

DELETE from locations c where id in (
    select d.id from locations d
        JOIN barcodes ON d.barcode_id = barcodes.id 
        ORDER BY barcodes.value ASC limit 60000
);

There is not much hope for getting PostgreSQL to understand this weird cross-table relationship.

You might be able to get somewhat better performance by doing something like

set local enable_nestloop TO off;

But the real solution is probably to rethink how you are organize your data, or what you are trying to do with it. Usually barcodes serve as their own primary keys. Why do you need a separate table just to map a barcode to a primary key? And why do you need to know the lowest barcodes which exist somewhere (anywhere) on a regular basis? It sounds like an X-Y problem.

2
  • Would any of this matter if he has an index on barcodes(value) and he vacuum fulls afterward. He says he's done that and that he saw no speed up? Commented Mar 30, 2017 at 6:33
  • @EvanCarroll No, it wouldn't matter. There already is an index on barcodes(value), and the problem is with the joint cross-table distribution of the data itself, not the physical layout of the data. Commented Mar 30, 2017 at 15:38
1

It seems that your plan is saying that it's starting at the top of your index_barcodes_on_value index and then doing a lookup into the index_locations_on_barcode_id to see if there's a match. One might think that in order to return 50 rows (out of 500 000) that this would be the way to go, but actually it seems that it has to keep going until it's done this operation 372138 times before it finds 50 matches.

When you get a random data distribution it's quite understandable that you'd get 50 matches before you go through 70% of the table (it only takes 376 tries), thus the query is faster.

So, you'd probably like the query planner to choose a better join strategy maybe a hash join would be best for this particular query but it's hard to know if there's a general purpose 'solution' that would work in all situations.

If the surrogate keys in the barcode table are roughly sorted in the same order as the value then you could improve things by including in your first query WHERE barcodes.id >= (SELECT MIN(barcode_id) FROM locations) but that is depending upon a certain relationship between the values which may not exist.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.