PostgreSQL distinct rows joined with a count of distinct values in one column

Question

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:

  a  | b | u  | t 
-----+---+----+----
 foo | 1 |  1 | 10
 foo | 1 |  2 | 11
 foo | 1 |  2 | 11
 foo | 2 |  4 | 1
 foo | 3 |  5 | 2
 bar | 1 |  6 | 2
 bar | 2 |  7 | 2
 bar | 2 |  8 | 3
 bar | 3 |  9 | 4
 bar | 4 | 10 | 5
 bar | 5 | 11 | 6
 baz | 1 | 12 | 1
 baz | 1 | 13 | 2
 baz | 1 | 13 | 2
 baz | 1 | 13 | 3

There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.

I'm trying to build a query which will return the following results:

  a  | b | u  | t  | z 
-----+---+----+----+---
 foo | 1 |  1 | 10 | 3
 foo | 1 |  2 | 11 | 3
 foo | 2 |  4 | 1  | 3
 foo | 3 |  5 | 2  | 3
 bar | 1 |  6 | 2  | 5
 bar | 2 |  7 | 2  | 5
 bar | 2 |  8 | 3  | 5
 bar | 3 |  9 | 4  | 5
 bar | 4 | 10 | 5  | 5
 bar | 5 | 11 | 6  | 5

In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.

I can get just the z filter working as follows:

SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
  HAVING COUNT(b) > 2;

However, I'm stumped on combining this with the rest of the data in the table.

What's the most efficient way to do this?

Can you elaborate what should happen with u and t? You list them in the output but you don't indicate how they should be aggregated. — Patrick, Mar 2 '15 at 6:15
I assume foo | 4 in the result must really be foo | 3 (matching the given data) and having bar | 2) twice is incorrect and the second instance should be removed? A proper table definition would be helpful (\d tbl in psql) - showing constraints, data types, indices, etc. — Erwin Brandstetter, Mar 2 '15 at 9:20
@Patrick u and t should be aggregated using GROUP BY/DISTINCT semantics, the same as a and b. — Ray, Mar 2 '15 at 12:54
@ErwinBrandstetter foo | 4 in the results was indeed a mistake, thanks. Fixed. However, bar | 2 twice is correct because the two bar | 2 rows have different u and t values, which I want to capture. — Ray, Mar 2 '15 at 12:56
@Ray: Then why is there no row foo | 1 | 2 | 11 | 3 in the result? It differs on t. — Erwin Brandstetter, Mar 2 '15 at 12:58

Erwin Brandstetter · Accepted Answer · 2015-03-02 15:14:01Z

Your first step can be simpler already:

SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM   t
GROUP  BY 1
HAVING count(DISTINCT b) > 2;

For the integrated query I would take a slightly different approach (aiming for index support and an emulated "loose index scan"):

SELECT t.*, a.z
FROM  (
   SELECT md5(a) AS md5_a, b, count(*) OVER (PARTITION BY a) AS z
   FROM   t
   GROUP  BY 1, 2
   ) a
, LATERAL (
   SELECT *
   FROM   t
   WHERE  md5(a) = a.md5_a
   AND    b = a.b
   ORDER  BY u, t
   LIMIT  1
  ) t
WHERE a.z > 2;

Assuming (for lack of information) that you want the row with the smallest (u, t) for each set of duplicates on (a, b).

Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.

The count(*) OVER ... works in a single query level because window functions are applied after aggregation:

Best way to get result count before LIMIT was applied in PHP/PostgreSQL

Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental, but an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).

Detailed explanation (especially chapter 2. JOIN LATERAL):

Optimize GROUP BY query to retrieve latest record per user

For updated question: unique on `(a, b, u, t)`

That's actually a lot simpler:

SELECT DISTINCT ON (md5(t.a), b, u, t)
       t.a, t.b, t.u, t.t, a.z
FROM  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) a
JOIN   t ON md5(t.a) = md5_a;

I am avoiding to GROUP BY the big column again (which would be expensive).

About DISTINCT ON:

Select first row in each GROUP BY group?

asked	10 months ago
viewed	160 times
active	10 months ago

current community

your communities

more stack exchange communities

PostgreSQL distinct rows joined with a count of distinct values in one column

1 Answer 1

For updated question: unique on `(a, b, u, t)`

Your Answer

Not the answer you're looking for? Browse other questions tagged sql postgresql count distinct aggregate-functions or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

PostgreSQL distinct rows joined with a count of distinct values in one column

1 Answer 1

For updated question: unique on (a, b, u, t)

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql postgresql count distinct aggregate-functions or ask your own question.

Linked

Related

Hot Network Questions

For updated question: unique on `(a, b, u, t)`