Sign up ×
Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";
                              QUERY PLAN
-----------------------------------------------------------------------
 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)
   Group Key: "labelDate"
   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)

Table structure:

http=# \d pages
                                       Table "public.pages"
     Column      |          Type          |        Modifiers
-----------------+------------------------+----------------------------------
 pageid          | integer                | not null default nextval('...
 createDate      | integer                | not null
 archive         | character varying(16)  | not null
 label           | character varying(32)  | not null
 wptid           | character varying(64)  | not null
 wptrun          | integer                | not null
 url             | text                   |
 urlShort        | character varying(255) |
 startedDateTime | integer                |
 renderStart     | integer                |
 onContentLoaded | integer                |
 onLoad          | integer                |
 PageSpeed       | integer                |
 rank            | integer                |
 reqTotal        | integer                | not null
 reqHTML         | integer                | not null
 reqJS           | integer                | not null
 reqCSS          | integer                | not null
 reqImg          | integer                | not null
 reqFlash        | integer                | not null
 reqJSON         | integer                | not null
 reqOther        | integer                | not null
 bytesTotal      | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesImg        | integer                | not null
 bytesFlash      | integer                | not null
 bytesJSON       | integer                | not null
 bytesOther      | integer                | not null
 numDomains      | integer                | not null
 labelDate       | date                   |
 TTFB            | integer                |
 reqGIF          | smallint               | not null
 reqJPG          | smallint               | not null
 reqPNG          | smallint               | not null
 reqFont         | smallint               | not null
 bytesGIF        | integer                | not null
 bytesJPG        | integer                | not null
 bytesPNG        | integer                | not null
 bytesFont       | integer                | not null
 maxageMore      | smallint               | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 fullyLoaded     | integer                |
 cdn             | character varying(64)  |
 SpeedIndex      | integer                |
 visualComplete  | integer                |
 gzipTotal       | integer                | not null
 gzipSavings     | integer                | not null
 siteid          | numeric                |
Indexes:
    "pages_pkey" PRIMARY KEY, btree (pageid)
    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
    "idx_pages_cdn" btree (cdn)
    "idx_pages_labeldate" btree ("labelDate") CLUSTER
    "idx_pages_urlshort" btree ("urlShort")
Triggers:
    pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()
share|improve this question
    
Can you add table structures? –  Tom V Jun 30 at 12:51
    
Do you need the whole structure? It's a very wide table effectively based on this: httparchive.org/downloads/httparchive_schema.sql labelDate is added as a normalised date of the label field for just this kind of query. –  Charlie Clark Jun 30 at 12:54
    
@ErwinBrandstetter the query is actually designed to obtain the range of dates from this denormalised table (this is then put in a lookup table and subsequent queries use the index). But even when adding a BETWEEN condition the sequential scan occurs. I'm using Postgres 9.4.1 on Mac OS. –  Charlie Clark Jul 1 at 6:48
    
I get that. What do you know about the possible result before you run the query? Can there be future dates? Can there be dates from 1900? Are dates sequential? If we can whip up a provisional lookup-table with a (not too big) superset of possible dates, we might have a simpler query. Failing that, emulating a loose index scan like ypercube provided is your best bet. –  Erwin Brandstetter Jul 1 at 7:05
    
Yes, the date range is effectively known (> 2010, < today). This is a head-banging data set. Getting the range would be something like select "labelDate" from pages where pageid in (select max(pageid) from pages) or pageid in (select min(pageid) from pages); Though seems just as slow. –  Charlie Clark Jul 1 at 7:23

3 Answers 3

up vote 5 down vote accepted

This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

share|improve this answer
    
Thanks very much. That describes things perfectly. It's an ugly query but always nice to learn these things! Query does indeed run much faster. –  Charlie Clark Jun 30 at 13:49
    
Nice one, v. elegant. I'm going to have to learn about LATERAL. +1 –  Vérace Jun 30 at 13:58
    
For just the one column in the result, the simpler variant with a correlated subquery might be the better choice here. Chaper 1b in the reference answer.. And it seems like the OP uses quoted, mixed-case identifiers: "labelDate". –  Erwin Brandstetter Jun 30 at 21:55

The best query very much depends on data distribution.

You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of these solutions will be blazingly fast as soon as the index is used. The partial index below will be a bit faster if you have many NULL values.
For more distinct values it would get more interesting.

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".
With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE  "labelDate" IS NOT NULL;

Assuming this for the rest of my answer.

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:

SELECT d."labelDate"
FROM   (
   SELECT generate_series(min("labelDate")
                        , max("labelDate")
                        , interval '1 day')::date AS "labelDate"
   FROM   pages
   ) d
WHERE EXISTS (SELECT 1 FROM pages p WHERE p."labelDate" = d."labelDate");

min() and max() can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. But only if your "knowledge" is just as accurate. Example:

SELECT d."labelDate"
FROM  (SELECT date '2011-1-1' + g AS "labelDate"
       FROM   generate_series(0, now()::date  - date '2011-1-1' - 1) g) d
WHERE  EXISTS (SELECT 1 FROM pages p WHERE p."labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"
FROM  (SELECT date '2011-1-1' + g AS "labelDate"
       FROM generate_series(0, 363) g) d
WHERE  EXISTS (SELECT 1 FROM pages p WHERE p."labelDate" = d."labelDate");

Loose index scan

This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.

WITH RECURSIVE p AS (
   (
   SELECT "labelDate"
   FROM   pages
   WHERE  "labelDate" IS NOT NULL
   ORDER  BY "labelDate"
   LIMIT  1
   ) 
   UNION ALL
   SELECT (SELECT "labelDate" 
           FROM   pages 
           WHERE  "labelDate" > p."labelDate" 
           ORDER  BY "labelDate" 
           LIMIT  1)
   FROM   p
   WHERE  "labelDate" IS NOT NULL
   ) 
SELECT "labelDate" 
FROM   p
WHERE  "labelDate" IS NOT NULL;
  • The first CTE p is effectively the same as

    SELECT min("labelDate") FROM pages
    

    But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests). The added parentheses are required to allow a LIMIT in an individual SELECT of a UNION (ALL) query.

  • Since we are only fetching a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate".

Detailed explanation:

Asides

Unquoted, legal, lower case identifiers make your life easier.
Reordering columns in your table definition could buy you a smaller on.disk storage:

share|improve this answer
    
Thank you very much for the detailed explanation. It provides a lot of insight into the process. Of course, the question behind my question is: why isn't the query planner smart enough to do this itself? Your answer contains some information about this. Regarding the non-NULL index I'm not sure if this is possible: this data is being imported from MySQL where the column does not exist (it is added in Postgres using a trigger). But I suspect I could write an equivalent functional index based on the original text field format like 'Jun 15 2015'. The import also explains the camelCase names. –  Charlie Clark Jul 2 at 7:00

From the postgresql documentation:

CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

share|improve this answer
    
Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan. –  Charlie Clark Jun 30 at 12:58
    
OK. As asked in question could you post the output of \d pages –  Fabrizio Mazzoni Jun 30 at 12:59
    
Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause. –  Charlie Clark Jun 30 at 13:16
    
Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan. –  Fabrizio Mazzoni Jun 30 at 13:20
    
Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries. –  Charlie Clark Jun 30 at 13:23

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.