postgres large table select optimization

Question

I have to extract DB to external DB server for licensed software. DB has to be Postgres and I cannot change select query from application (cannot change source code).

Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).

All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.

Select query is:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table 
WHERE '00436641997142' LIKE prefix 
AND company = 0  and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )  
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC 
LIMIT 1;

Explain analyze shows following

Limit  (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
  ->  Sort  (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
        Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on table  (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
              Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
              Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`

Slowest part of query is:

 SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table 
 WHERE '00436641997142' LIKE prefix

which generates 1,6s (tested only this part of query)

Part of query tested separately:

Seq Scan on table  (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
  Filter: ('004366491997142'::text ~~ (prefix)::text)
  Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms

About data itself: column "prefix" has identical first several digits (first 5) and rest are different, unique ones.

Postgres version is 9.5 I've changed following settings of Postgres:

random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB

I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same. I've also did, but no visible improvements:

vacuum analyze verbose table

Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.

Current HW is i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW. As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.

Add-on 1: After adding 9 indexes, index is not used also.

Add-on 2: 1) I found out reason for not using index: word order in query in part like is reason. if query would be:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

it uses index.

notice difference:

... WHERE '00436641997142%' like prefix ...

query which uses index correctly:

... WHERE prefix like '00436641997142%' ...

since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.

2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix 
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

uses parallel mode.

Any idea how to force original query (I cannot change query):

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix 
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

to use parallel seq. scan?

Is there any reasons why you are using like instead of =? — Abelisto, Aug 8 at 21:05
I cannot change it > it is in source code to which I don't have access. Query with "=" instead of "like" is 3x faster, but I cannot change it — positive, Aug 8 at 21:06
default was 4, and I increased to see is there any difference. basically response time is same as for 4 or 8 — positive, Aug 8 at 21:10
And how many rows it returns? Try to optimize this part of the query. I dont know any way to optimize conditions like const LIKE column (without changing query/data structure/application logic). — Abelisto, Aug 8 at 21:55

Eugene Lisitsky · Accepted Answer · 2016-08-15 10:51:44Z

It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.

I can suggest one risky solution:

Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:

select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;

       prefix_low                 |               prefix_high             |   prefix_length
----------------------------------+---------------------------------------+-----
 00436641997142000000000000000000 | 00436641997142999999999999999999      |   14

Make index with these values

CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);

Check modified requests against table:

SELECT prefix, changeprefix, deletelast, outgroup, tariff 
FROM table 
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
  AND company = 0  
  AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
ORDER BY prefix_length DESC 
LIMIT 1

Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.

Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:

import re

def rewrite_query(username, query):
   q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
   if not re.match(q1, query):
      return query  # nothing to do with other queries
   else:
      new_query = # ... rewrite query here
   return new_query

Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.

Pro:

no changes to application
no changes to basic structure of DB

Contra:

extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.

Further development: highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing

many thanks for idea and answer this will help me a lot > I will implement and test it; but basically idea is perfect; thx — positive, Aug 16 at 9:00
instead of creating additional column for prefix length, you can create index for it. CREATE INDEX table_prefix_length_idx ON table (char_length(prefix)) — dev7532, Aug 16 at 15:47

dev7532 · Answer 2 · 2016-08-15 19:03:56Z

up vote 1 down vote

If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.

Here is an example from another question.

Then you could change "LIKE" to "=" in your query, and it would run a lot faster.

answered Aug 15 at 19:03

dev7532

9114

add a comment |

Borys · Answer 3 · 2016-08-08 21:33:37Z

up vote 0 down vote

You should change your index by adding proper operator class, according to documentation:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale. As an example, you might index a varchar column like this:

CREATE INDEX test_index ON test_table (col varchar_pattern_ops);

answered Aug 8 at 21:33

Borys

844917

I've just added it with: CREATE INDEX test_index ON erm_table (prefix bpchar_pattern_ops); since prefix field is character type but, query speed is same > 1,7s – positive Aug 8 at 21:40

have you run analyze after? is it the right class? – Borys Aug 8 at 21:45

have you dropped the old one? set SET enable_seqscan = OFF; for testing – Borys Aug 8 at 21:50

I've posted analyzer in initial post; basically, result is same, and seq. scan is used although it is put to off (strange?) – positive Aug 8 at 22:00

ok, now I have red the whole query. The problem could be somewhere else, expressions like those: (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 are black boxes for optimizer. maybe try just for verification to add a partial index with that in where clause: ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null ) – Borys Aug 8 at 22:30

| show 2 more comments

asked	23 days ago
viewed	170 times
active	16 days ago

current community

your communities

more stack exchange communities

postgres large table select optimization

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged sql postgresql sqlperformance postgresql-performance or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

postgres large table select optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql postgresql sqlperformance postgresql-performance or ask your own question.

Linked

Related

Hot Network Questions