I am storing large vectors (1.4 million values) of doubles in a PostgreSQL table. This table's create statement follows.
CREATE TABLE analysis.expression
(
celfile_name character varying NOT NULL,
core double precision[],
extended double precision[],
"full" double precision[],
probeset double precision[],
CONSTRAINT expression_pkey PRIMARY KEY (celfile_name)
)
WITH (
OIDS=FALSE
);
ALTER TABLE analysis.expression ALTER COLUMN core SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN extended SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN "full" SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN probeset SET STORAGE EXTERNAL;
Each entry in this table is written only once and possibly read many times at random indices. PostgreSQL doesn't seem to scale terribly well for lookups as the vector length grows even with STORAGE set to EXTERNAL (O(n)). This makes queries like the following, where we selected many individual values in the array, very, very slow (minutes - hours).
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression LIMIT 1000;
If there enough individual indices being pulled it can even be slower than pulling the whole array.
Is there any way to make such queries faster?
Edits
- I am using PostgreSQL 9.3.
All the queries I am running are simple SELECTs possibly
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression JOIN samples s USING (celfile_name) WHERE s.study = 'x';
In one scenario the results of these queries are feed through prediction models. The prediction probability gets stored back into the DB in another table. In other cases select items are pulled from the arrays for downstream analysis.
Currently 1.4 million is the longest single array, the others are shorter with the smallest being 22 thousand and the average being ~ 100 thousand items long.
- Ideally I would store the array data as a wide table but with 1.4 million entries this isn't feasible, and long tables (i.e. rows with celfile_name, index, value) are much slower than PostgreSQL arrays if we want to pull a full array from the data from the DB. We do this to load our downstream data stores for when we do analysis on the full dataset.