I have 2 tables in PostgreSQL 9.1 - flight_2012_09_12 containing approx 500,000 rows and position_2012_09_12 containing about 5.5 million rows. I'm running a simple join query and it's taking a long time to complete and despite the fact the tables aren't small I'm convinced there are some major gains to be made in the execution.
The query is:
SELECT f.departure, f.arrival,
p.callsign, p.flightkey, p.time, p.lat, p.lon, p.altitude_ft, p.speed
FROM position_2012_09_12 AS p
JOIN flight_2012_09_12 AS f
ON p.flightkey = f.flightkey
WHERE p.lon < 0
AND p.time BETWEEN '2012-9-12 0:0:0' AND '2012-9-12 23:0:0'
The output of explain analyze is:
Hash Join (cost=239891.03..470396.82 rows=4790498 width=51) (actual time=29203.830..45777.193 rows=4403717 loops=1)
Hash Cond: (f.flightkey = p.flightkey)
-> Seq Scan on flight_2012_09_12 f (cost=0.00..1934.31 rows=70631 width=12) (actual time=0.014..220.494 rows=70631 loops=1)
-> Hash (cost=158415.97..158415.97 rows=3916885 width=43) (actual time=29201.012..29201.012 rows=3950815 loops=1)
Buckets: 2048 Batches: 512 (originally 256) Memory Usage: 1025kB
-> Seq Scan on position_2012_09_12 p (cost=0.00..158415.97 rows=3916885 width=43) (actual time=0.006..14630.058 rows=3950815 loops=1)
Filter: ((lon < 0::double precision) AND ("time" >= '2012-09-12 00:00:00'::timestamp without time zone) AND ("time" <= '2012-09-12 23:00:00'::timestamp without time zone))
Total runtime: 58522.767 ms
I think the problem lies with the sequential scan on the position table but I can't figure out why it's there. The table structures with indexes are below:
Table "public.flight_2012_09_12"
Column | Type | Modifiers
--------------------+-----------------------------+-----------
callsign | character varying(8) |
flightkey | integer |
source | character varying(16) |
departure | character varying(4) |
arrival | character varying(4) |
original_etd | timestamp without time zone |
original_eta | timestamp without time zone |
enroute | boolean |
etd | timestamp without time zone |
eta | timestamp without time zone |
equipment | character varying(6) |
diverted | timestamp without time zone |
time | timestamp without time zone |
lat | double precision |
lon | double precision |
altitude | character varying(7) |
altitude_ft | integer |
speed | character varying(4) |
asdi_acid | character varying(4) |
enroute_eta | timestamp without time zone |
enroute_eta_source | character varying(1) |
Indexes:
"flight_2012_09_12_flightkey_idx" btree (flightkey)
"idx_2012_09_12_altitude_ft" btree (altitude_ft)
"idx_2012_09_12_arrival" btree (arrival)
"idx_2012_09_12_callsign" btree (callsign)
"idx_2012_09_12_departure" btree (departure)
"idx_2012_09_12_diverted" btree (diverted)
"idx_2012_09_12_enroute_eta" btree (enroute_eta)
"idx_2012_09_12_equipment" btree (equipment)
"idx_2012_09_12_etd" btree (etd)
"idx_2012_09_12_lat" btree (lat)
"idx_2012_09_12_lon" btree (lon)
"idx_2012_09_12_original_eta" btree (original_eta)
"idx_2012_09_12_original_etd" btree (original_etd)
"idx_2012_09_12_speed" btree (speed)
"idx_2012_09_12_time" btree ("time")
Table "public.position_2012_09_12"
Column | Type | Modifiers
-------------+-----------------------------+-----------
callsign | character varying(8) |
flightkey | integer |
time | timestamp without time zone |
lat | double precision |
lon | double precision |
altitude | character varying(7) |
altitude_ft | integer |
course | integer |
speed | character varying(4) |
trackerkey | integer |
the_geom | geometry |
Indexes:
"index_2012_09_12_altitude_ft" btree (altitude_ft)
"index_2012_09_12_callsign" btree (callsign)
"index_2012_09_12_course" btree (course)
"index_2012_09_12_flightkey" btree (flightkey)
"index_2012_09_12_speed" btree (speed)
"index_2012_09_12_time" btree ("time")
"position_2012_09_12_flightkey_idx" btree (flightkey)
"test_index" btree (lon)
"test_index_lat" btree (lat)
I can't think of any other way to rewrite the query and so I'm stumped at this point. If the current setup is as good as it gets so be it but it seems to me that it should be much faster than it currently is. Any help would be much appreciated.
BETWEEN
probably isn't doing what you want - it's going to include the first second/millisecond/nanosecond (what granularity does postgreSQL run at) of 11:00pm; always (almost) use exclusive upper-bounds (<
), especially with timestamps. Also, why only 11 (and not until just before the next day)? And'time'
is a non-descriptive name for a column - maybe use something likerecordedAt
? – Clockwork-Muse Dec 4 '12 at 23:41