Why is PostgreSQL (9.1) not using index for simple equality select?

Question

My table lead has an index:

\d lead
...
Indexes:
    "lead_pkey" PRIMARY KEY, btree (id)
    "lead_account__c" btree (account__c)
    ...
    "lead_email" btree (email)
    "lead_id_prefix" btree (id text_pattern_ops)

Why doesn't PG (9.1) use the index for this straightforward equality selection? Emails are almost all unique....

db=> explain select * from lead where email = 'blah';
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on lead  (cost=0.00..319599.38 rows=1 width=5108)
   Filter: (email = 'blah'::text)
(2 rows)

Other index-hitting queries seem to be OK (though I don't know why this one doesn't just use the pkey index):

db=> explain select * from lead where id = '';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using lead_id_prefix on lead  (cost=0.00..8.57 rows=1 width=5108)
   Index Cond: (id = ''::text)
(2 rows)

db=> explain select * from lead where account__c = '';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using lead_account__c on lead  (cost=0.00..201.05 rows=49 width=5108)
   Index Cond: (account__c = ''::text)
(2 rows)

At first I thought it may be due to not enough distinct values of email. For instance, if the stats claim that email is blah for most of the table, then a seq scan is faster. But that's not the case:

db=> select count(*), count(distinct email) from lead;
 count  | count
--------+--------
 749148 | 733416
(1 row)

Even if I force seq scans to be off, the planner behaves as if it has no other choice:

db=> set enable_seqscan = off;
SET
db=> show enable_seqscan;
 enable_seqscan
----------------
 off
(1 row)

db=> explain select * from lead where email = '[email protected]';
                            QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on lead  (cost=10000000000.00..10000319599.38 rows=1 width=5108)
   Filter: (email = '[email protected]'::text)
(2 rows)

Also tried EXPLAIN ANALYZE:

db=> explain analyze select * from lead where email = '[email protected]';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on lead  (cost=10000000000.00..10000319732.76 rows=1 width=5102) (actual time=77845.244..77845.244 rows=0 loops=1)
   Filter: (email = '[email protected]'::text)
 Total runtime: 77857.215 ms
(3 rows)

Here is the \d output (sorry, have to obscure the column names, and cropped to fit in SO's limits; see uncropped version at http://pastebin.com/ve3gzJpY):

                                 Table "lead"
                   Column                   |            Type             | Modifiers 
--------------------------------------------+-----------------------------+-----------
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | boolean                     | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 email                                      | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | boolean                     | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 account__c                                 | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 id                                         | text                        | not null
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | timestamp without time zone | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
Indexes:
    "lead_pkey" PRIMARY KEY, btree (id)
    "lead_account__c" btree (account__c)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_email" btree (email)
    "lead_id_prefix" btree (id text_pattern_ops)

Here is pg_dump --schema-only -t lead (again see uncropped at http://pastebin.com/ve3gzJpY, with unique column names as well in case it helps reproducibility):

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: lead; Type: TABLE; Schema: public; Owner: pod; Tablespace: 
--

CREATE TABLE lead (
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX boolean,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX date,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    account__c text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    id text NOT NULL,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX timestamp without time zone,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real
);


ALTER TABLE lead OWNER TO pod;

--
-- Name: lead_pkey; Type: CONSTRAINT; Schema: public; Owner: pod; Tablespace: 
--

ALTER TABLE ONLY lead
    ADD CONSTRAINT lead_pkey PRIMARY KEY (id);


--
-- Name: lead_account__c; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_account__c ON lead USING btree (account__c);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_email; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_email ON lead USING btree (email);


--
-- Name: lead_id_prefix; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_id_prefix ON lead USING btree (id text_pattern_ops);


--
-- PostgreSQL database dump complete
--

Some PG catalog incantation:

db=> select * from pg_index where indexrelid = 'lead_email'::regclass;
 indexrelid | indrelid  | indnatts | indisunique | indisprimary | indisexclusion | indimmediate | indisclustered | indisvalid | indcheckxmin | indisready | indkey | indcollation | indclass | indoption | indexprs | indpred
------------+-----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+--------+--------------+----------+-----------+----------+---------
  215251995 | 101034456 |        1 | f           | f            | f              | t            | f              | t          | t            | t          | 101    | 100          | 10043    | 0         | ¤        | ¤
(1 row)

Some locale info:

db=> show lc_collate;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

db=> show lc_ctype;
  lc_ctype   
-------------
 en_US.UTF-8
(1 row)

I searched over a good number of past SO questions but none were about a simple equality query like this one.

Weird ... simple equality shouldn't need a text_pattern_ops index, so this is hard to explain. Can you reproduce this in a small sample? If so, post to sqlfiddle.com and link here. — Craig Ringer, Apr 12 '13 at 9:24
Please show the full table definition (preferably via pg_dump). — Peter Eisentraut, Apr 12 '13 at 12:15
@PeterEisentraut Updated the question with \d and pg_dump schema. — Yang, Apr 12 '13 at 17:09
@CraigRinger I will try my best to repro but that may take a while - there's a lot of data in this table containing sensitive customer information. — Yang, Apr 12 '13 at 17:10
your column name mangling wasn't very well executed -- you ended up with all columns being named the same. This makes it hard to reproduce your scenario, because the CREATE INDEX statements refer to ambiguous names (not to mention the fact that CREATE TABLE itself fails because of duplicate column names). It'd be a lot better to use different names for each colum. Also, what are the lc_collate and lc_ctype settings? Those might be important for a reproducer (The index has indcollate=100 which means "the default collation"). Anyway, a nondefault collation would show up as "modifier" .. — alvherre, Apr 12 '13 at 17:44

Mike Sherrill 'Cat Recall' · Answer 1 · 2013-04-12 17:38:23Z

up vote 0 down vote

CREATE INDEX lead_id_prefix ON lead USING btree (id text_pattern_ops);

The use of text_pattern_ops seems odd here. If your ID is some kind of integer, I'd try dropping this index as a test. (I wouldn't hesitate to drop this index on a development server.) Since you have another btree index on "lead.id", I'd expect dropping this index to cajole the optimizer into using the other index on "lead.id".

If that proves to be true, then I'll try to dig deeper into the causes.

answered Apr 12 '13 at 17:38

Mike Sherrill 'Cat Recall'
38.1k23367

note the problem query is on column email, not on id. – alvherre Apr 12 '13 at 17:45

Ha! I read right past "where email = 'blah'" to "where id = ''", and "id" stuck in my head! – Mike Sherrill 'Cat Recall' Apr 12 '13 at 20:13

add a comment |

Chris Travers · Answer 2 · 2013-05-16 04:50:00Z

To troubleshoot these it is imperative that you run VACUUM ANALYSE on the table between troubleshooting steps to see what works. Otherwise you may not know exactly what changed where. So try that and run again first and see if it fixes the problem.

The next steps to run (run vacuum analyse and atest case between each one) are:

ALTER TABLE lead ALTER COLUMN email SET STATISTICS 1000;

Maybe that will fix it. Maybe not.

If that doesn't fix it, take a close look at the pg_stat view:

SELECT * FROM pg_stat WHERE table_name = 'lead';

Please read the following thoroughly and see what you can see that is amiss in pg_stat;

http://www.postgresql.org/docs/9.0/static/planner-stats.html

EDIT: To be very clear, vacuum analyse is not the whole of the troubleshooting. However it MUST be run in between troubleshooting steps because otherwise you can't be sure the planner is taking into account correct data.

The problem had gone away as mysteriously as it came, but it's worth noting that I already did try running VACUUM ANALYZE - a lot. — Yang, May 15 '13 at 21:19
The point is to run it between troubleshooting steps to make sure that the analyzer is making use of current settings etc. — Chris Travers, May 16 '13 at 3:15

asked	1 year ago
viewed	295 times
active	1 year ago

current community

your communities

more stack exchange communities

Why is PostgreSQL (9.1) not using index for simple equality select?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql index postgresql-9.1 or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Why is PostgreSQL (9.1) not using index for simple equality select?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql index postgresql-9.1 or ask your own question.

Linked

Related

Hot Network Questions