PostgreSQL: FULL OUTER JOIN on RANGE datatypes

Question

I have several tables with partially overlapping TSTZRANGE values and I need to JOIN them such that "breakpoints" are created for each UPPER and LOWER boundary for each range.

Some sample data:

CREATE TABLE foo (
    foo_id SERIAL PRIMARY KEY,
    foo TEXT,
    active_timespan TSTZRANGE
);
INSERT INTO 
    foo (foo, active_timespan)
VALUES
    ('One',TSTZRANGE('2015-01-01 00:00:00-00','2015-01-01 06:00:00-00')),
    ('Two',TSTZRANGE('2015-01-01 12:00:00-00','2015-01-01 18:00:00-00'));

CREATE TABLE bar (
    bar_id SERIAL PRIMARY KEY,
    bar TEXT,
    active_timespan TSTZRANGE
);
INSERT INTO 
    bar (bar, active_timespan)
VALUES
    ('Alpha',TSTZRANGE('2015-01-01 03:00:00-00','2015-01-01 09:00:00-00')),
    ('Bravo',TSTZRANGE('2015-01-01 15:00:00-00','2015-01-01 21:00:00-00'));

Desired Result:

active_timespan                                       | foo  | bar
------------------------------------------------------|------|------
'["2015-01-01 00:00:00-00","2015-01-01 03:00:00-00"]' | One  | Null
'["2015-01-01 03:00:00-00","2015-01-01 06:00:00-00"]' | One  | Alpha
'["2015-01-01 06:00:00-00","2015-01-01 09:00:00-00"]' | Null | Alpha
'["2015-01-01 12:00:00-00","2015-01-01 15:00:00-00"]' | Two  | Null
'["2015-01-01 15:00:00-00","2015-01-01 18:00:00-00"]' | Two  | Bravo
'["2015-01-01 18:00:00-00","2015-01-01 21:00:00-00"]' | Null | Bravo

Current approach:

I currently use an elaborate system of cte's where I select the data I want from each table into it's own cte and then UNION all the the distinct UPPER and LOWER range values from each query, leaving me with a list of TIMESTAMPTZ values. I then use a LEAD window function to make new TSTZRANGE values. Once I have all that complete I join the original cte to the new list of TSTZRANGE values.

Update

Current query pattern below:

WITH
    cte_foo AS (
        SELECT
            foo_id,
            foo,
            active_timespan
        FROM
            foo
        WHERE
            active_timespan && $1
    )
    , cte_bar AS (
        SELECT
            bar_id,
            bar,
            active_timespan
        FROM
            bar
        WHERE
            active_timespan && $1
    )
--  continue for each table to be joined... 
--  , cte_baz AS (
--      SELECT
--          baz_id,
--          baz,
--          active_timespan
--      FROM
--          baz
--      WHERE
--          active_timespan && $1
--  )
    , cte_times AS (
        SELECT DISTINCT
            UNNEST(
                ARRAY[
                    LOWER(cte_foo.active_timespan),
                    UPPER(cte_foo.active_timespan)
                ]
            ) AS breakpoint
        FROM
            cte_foo

        UNION
        SELECT DISTINCT
            UNNEST(
                ARRAY[
                    LOWER(cte_bar.active_timespan),
                    UPPER(cte_bar.active_timespan)
                ]
            ) AS breakpoint
        FROM
            cte_bar

--      continue for each table to be joined... 
--      UNION
--      SELECT DISTINCT
--          UNNEST(
--              ARRAY[
--                  LOWER(cte_baz.active_timespan),
--                  UPPER(cte_baz.active_timespan)
--              ]
--          ) AS breakpoint
--      FROM
--          cte_baz
    )
    , cte_timespans AS (
        SELECT
            TSTZRANGE(
                sub_times.breakpoint,
                LEAD(sub_times.breakpoint, 1, 'infinity'::TIMESTAMPTZ) OVER (ORDER BY sub_times.breakpoint)
            ) AS timespan
        FROM
            (
                SELECT DISTINCT
                    cte_times.breakpoint
                FROM
                    cte_times
            ) AS sub_times
        ORDER BY
            timespan
    )
SELECT
    cte_timespans.timespan,
    cte_foo.foo,
    cte_bar.bar
FROM
    cte_timespans
    LEFT OUTER JOIN cte_foo ON cte_timespans.timespan && cte_foo.active_timespan
    LEFT OUTER JOIN cte_bar ON cte_timespans.timespan && cte_bar.active_timespan

--  continue for each table to be joined...
--  LEFT OUTER JOIN cte_baz ON cte_timespans.timespan && cte_baz.active_timespan
ORDER BY
    cte_timelines.timespan

Performance is obviously poor, especially when I am joining several tables (over ten in some cases). Is this the best way to approach this situation?

Wishful thinking:

It would be great if PostgreSQL had some kind of RANGE JOIN - something along the lines of:

SELECT
    *
FROM
    foo
    FULL OUTER RANGE JOIN bar ON foo.active_timespan && bar.active_timespan

"Performance is obviously poor," I don't see any proof that in your question. Please read: wiki.postgresql.org/wiki/Slow_Query_Questions and supply the execution plan (use explain analyze) on your real world data. The range data type can also be indexed: postgresql.org/docs/current/static/… Did you try that? — user330315
– user330315, Commented Sep 22, 2015 at 13:04
@a_horse_with_no_name - the performance takes a hit due to the nature of the cte (described as "optimization fences" in the documentation). They also remove all indices after the initial SELECT. — losthorse
– losthorse, Commented Sep 22, 2015 at 13:10
There is no CTE in your question. Plus: the full outer join will return all rows from both tables - that can't really be optimized. The database has to read all rows from foo and all rows from bar. — user330315
– user330315, Commented Sep 22, 2015 at 13:11
If you want help with a slow query,post the real query you are using. — user330315
– user330315, Commented Sep 22, 2015 at 13:17
I don't see any indexes in your question? Is that a shortcoming in your question or in your setup? I also don't see a query plan. Consider instructions here: stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter
– Erwin Brandstetter, Commented Sep 22, 2015 at 14:18

byMike · Accepted Answer · 2015-09-22 14:28:44Z

0

assumptions

changing data structure is not an option
timespans have a proper index

optimization goals

less data in cte queries
simplify query
"lazy" load data

process steps

generate a list of (unique) breakpoints
compile a list of ranges between breakpoints
load data matching the ranges

resulting query

WITH
  eventlist AS (
    SELECT DISTINCT
      UNNEST(
          ARRAY [
          LOWER( active_timespan ),
          UPPER( active_timespan )
          ]
      ) AS timestamp
    FROM foo WHERE active_timespan && $1
    UNION DISTINCT
    SELECT
      UNNEST(
          ARRAY [
          LOWER( active_timespan ),
          UPPER( active_timespan )
          ]
      ) AS timestamp
    FROM bar WHERE bar.active_timespan && $2
  ),
  durations AS (
    SELECT
      TSTZRANGE(
          timestamp,
          LEAD( timestamp, 1, 'infinity' ) OVER (ORDER BY timestamp ASC)
      ) AS active_span
    FROM eventlist
  )
SELECT
  durations.active_span,
  foo.foo,
  bar.bar
FROM durations
  LEFT JOIN foo ON active_span && foo.active_timespan
  LEFT JOIN bar ON active_span && bar.active_timespan
WHERE (foo.foo IS NOT NULL OR bar.bar IS NOT NULL)
  AND foo.active_timespan && $1
  AND bar.active_timespan && $2
;

answered Sep 22, 2015 at 14:28

byMike

3193 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

losthorse Over a year ago

This looks like it does the same thing as my current configuration except it hits the base tables twice (and the final table joins are written differently)... am I missing something?

byMike Over a year ago

the difference is, only the list timestamps are loaded into the memory first, not the whole tables --> less memory required. as I see now, the last two where statements are redundant. If there are proper indexes on all the ranges, this should work pretty fast.

Collectives™ on Stack Overflow

PostgreSQL: FULL OUTER JOIN on RANGE datatypes

Update

1 Answer 1

assumptions

optimization goals

process steps

resulting query

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Update

1 Answer 1

assumptions

optimization goals

process steps

resulting query

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related