Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:

  a  | b | u  | t 
-----+---+----+----
 foo | 1 |  1 | 10
 foo | 1 |  2 | 11
 foo | 1 |  2 | 11
 foo | 2 |  4 | 1
 foo | 3 |  5 | 2
 bar | 1 |  6 | 2
 bar | 2 |  7 | 2
 bar | 2 |  8 | 3
 bar | 3 |  9 | 4
 bar | 4 | 10 | 5
 bar | 5 | 11 | 6
 baz | 1 | 12 | 1
 baz | 1 | 13 | 2
 baz | 1 | 13 | 2
 baz | 1 | 13 | 3

There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.

I'm trying to build a query which will return the following results:

  a  | b | u  | t  | z 
-----+---+----+----+---
 foo | 1 |  1 | 10 | 3
 foo | 1 |  2 | 11 | 3
 foo | 2 |  4 | 1  | 3
 foo | 3 |  5 | 2  | 3
 bar | 1 |  6 | 2  | 5
 bar | 2 |  7 | 2  | 5
 bar | 2 |  8 | 3  | 5
 bar | 3 |  9 | 4  | 5
 bar | 4 | 10 | 5  | 5
 bar | 5 | 11 | 6  | 5

In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.

I can get just the z filter working as follows:

SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
  HAVING COUNT(b) > 2;

However, I'm stumped on combining this with the rest of the data in the table.

What's the most efficient way to do this?

share|improve this question
2  
Can you elaborate what should happen with u and t? You list them in the output but you don't indicate how they should be aggregated. – Patrick Mar 2 '15 at 6:15
    
I assume foo | 4 in the result must really be foo | 3 (matching the given data) and having bar | 2) twice is incorrect and the second instance should be removed? A proper table definition would be helpful (\d tbl in psql) - showing constraints, data types, indices, etc. – Erwin Brandstetter Mar 2 '15 at 9:20
    
@Patrick u and t should be aggregated using GROUP BY/DISTINCT semantics, the same as a and b. – Ray Mar 2 '15 at 12:54
    
@ErwinBrandstetter foo | 4 in the results was indeed a mistake, thanks. Fixed. However, bar | 2 twice is correct because the two bar | 2 rows have different u and t values, which I want to capture. – Ray Mar 2 '15 at 12:56
    
@Ray: Then why is there no row foo | 1 | 2 | 11 | 3 in the result? It differs on t. – Erwin Brandstetter Mar 2 '15 at 12:58
up vote 1 down vote accepted

Your first step can be simpler already:

SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM   t
GROUP  BY 1
HAVING count(DISTINCT b) > 2;

For the integrated query I would take a slightly different approach (aiming for index support and an emulated "loose index scan"):

SELECT t.*, a.z
FROM  (
   SELECT md5(a) AS md5_a, b, count(*) OVER (PARTITION BY a) AS z
   FROM   t
   GROUP  BY 1, 2
   ) a
, LATERAL (
   SELECT *
   FROM   t
   WHERE  md5(a) = a.md5_a
   AND    b = a.b
   ORDER  BY u, t
   LIMIT  1
  ) t
WHERE a.z > 2;

Assuming (for lack of information) that you want the row with the smallest (u, t) for each set of duplicates on (a, b).

Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.

The count(*) OVER ... works in a single query level because window functions are applied after aggregation:

Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental, but an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).

Detailed explanation (especially chapter 2. JOIN LATERAL):

For updated question: unique on (a, b, u, t)

That's actually a lot simpler:

SELECT DISTINCT ON (md5(t.a), b, u, t)
       t.a, t.b, t.u, t.t, a.z
FROM  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) a
JOIN   t ON md5(t.a) = md5_a;

I am avoiding to GROUP BY the big column again (which would be expensive).

About DISTINCT ON:

share|improve this answer
    
Thanks for your help! – Ray Mar 2 '15 at 15:57

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.