Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I am storing large vectors (1.4 million values) of doubles in a PostgreSQL table. This table's create statement follows.

CREATE TABLE analysis.expression
(
  celfile_name character varying NOT NULL,
  core double precision[],
  extended double precision[],
  "full" double precision[],
  probeset double precision[],
  CONSTRAINT expression_pkey PRIMARY KEY (celfile_name)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE analysis.expression ALTER COLUMN core SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN extended SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN "full" SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN probeset SET STORAGE EXTERNAL;

Each entry in this table is written only once and possibly read many times at random indices. PostgreSQL doesn't seem to scale terribly well for lookups as the vector length grows even with STORAGE set to EXTERNAL (O(n)). This makes queries like the following, where we selected many individual values in the array, very, very slow (minutes - hours).

SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression LIMIT 1000;

If there enough individual indices being pulled it can even be slower than pulling the whole array.

Is there any way to make such queries faster?

Edits

  • I am using PostgreSQL 9.3.
  • All the queries I am running are simple SELECTs possibly

    SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression JOIN samples s USING (celfile_name) WHERE s.study = 'x';
    

    In one scenario the results of these queries are feed through prediction models. The prediction probability gets stored back into the DB in another table. In other cases select items are pulled from the arrays for downstream analysis.

  • Currently 1.4 million is the longest single array, the others are shorter with the smallest being 22 thousand and the average being ~ 100 thousand items long.

  • Ideally I would store the array data as a wide table but with 1.4 million entries this isn't feasible, and long tables (i.e. rows with celfile_name, index, value) are much slower than PostgreSQL arrays if we want to pull a full array from the data from the DB. We do this to load our downstream data stores for when we do analysis on the full dataset.
share|improve this question
4  
I suspect that Postgres needs to read the entire array before doing any operations on it. You may be better off storing the values as rows in a table, even if the table representation is larger. – Gordon Linoff May 2 '15 at 5:08
    
You are actually not storing your large arrays in PostgreSQL, you are only storing references to external storage. Unless you use the data from the arrays in queries that combine it with data from other relations, the use of PostgreSQL will not be efficient. How do you use the data in PostgreSQL? Can you elaborate on what the data in the arrays represents (e.g. time-series?) and the relationship between the four arrays (f.i. is the data at the same index in each of the arrays related?)? Are all four arrays 1.4M values long or is that an aggregate size? – Patrick May 2 '15 at 8:19
1  
It should be obvious for a performance question to provide the version of Postgres in use. Without having to be asked first. – Erwin Brandstetter May 2 '15 at 8:51
    
@GordonLinoff - I edited the question to explain why storing the values in PostgreSQL tables is not feasible. – Nixuz May 2 '15 at 15:38
    
@Patrick - I updated the question with more insight on additional queries which will be run. In short each array can very in length with the longest single array being 1.4 million items. The same index across arrays are not related. – Nixuz May 2 '15 at 15:43

You store your data in a structured data management storage container (i.e. PostgreSQL), but due to the nature of your data (i.e. large but irregularly sized collections of like data) you actually store your data outside of the container. PostgreSQL is not good at retrieving data from irregular and unpredictable?) large arrays, as you have noticed; the fact that the arrays are stored externally is already testament to the fact that your requirements are not aligned with where PostgreSQL excels. It is very likely that there are much better solutions for storing and reading your arrays than PostgreSQL. Given that the results from analyzing the arrays through prediction models is stored in some tables in a PostgreSQL database hints at a hybrid solution: store your data in some form that allows efficient access in the patterns that you need, then store the results in PostgreSQL for further processing.

Since you do not provide any details on the prediction models, it is impossible to be specific in this answer, but I hope this will help you on your way.

If your prediction models are written in some language for which a PostgreSQL driver is available, then store your data in some format that is suited for that language, do your predictions and write the results to a table in PostgreSQL. This would work for languages like C and C++ with the pq library and for Java, C#, Python, etc using a high-level library like JDBC.

If your prediction model is written in MatLab, then store your arrays in a MatLab format and connect to PostgreSQL for the results. If written in R, you can use the R extension for PostgreSQL.

The key here is that you should store the arrays in a form that allows for efficient use in your prediction models. Match your data storage to the prediction models, not the other way around.

share|improve this answer
    
This is more or less what we do in our current solution. We are using postgreSQL to store our meta information and we have two downstream datastores for these large arrays of expression data. One store is a custom 2D key value, in memory, data store, which is fast for queries of 1 - 1000 array indices. The other store is large binary files which essentially are memory mapped as a large C style 2D matrix, which scales better if we need to load a large number of indices for analysis. Ideally we could use the PostgreSQL array access to replace the custom 2D key value store. – Nixuz May 2 '15 at 16:32

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.