PostgreSQL importing thousands of columns as an array

Question

PROBLEM:

I have a CSV file with 16,382 columns with data that looks like this:

+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| PATIENT_ID  | DIAGNOSIS |  1   |   2   |  3   |  9   |  10  | 13  |  14   | ... |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| X764_130520 | 0         | 0.35 | 9.68  | 0.11 | 0.04 | 0.03 | 0   | 32.54 |     |
| X800_130701 | 0         | 2.24 | 32.04 | 0.13 | 0.34 | 0.04 | 0   | 39.47 | ... |
| X164_120423 | 6         | 3.12 | 24.08 | 0.04 | 0.1  | 0.08 | 0   | 73.47 | ... |
| X218_120425 | 6         | 2.48 | 20.62 | 0    | 0.22 | 0    | 0   | 59.06 | ... |
| ...         | ...       | ...  | ...   | ...  | ...  | ...  | ... | ...   |     |
| X824_130725 | 1         | 0.76 | 44.77 | 0.1  | 0.28 | 0.13 | 0   | 45.35 |     |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+

RAW FORMAT:

PATIENT_ID,DIAGNOSIS,1,2,3,9,10,13,14,15,16,18,19,20,21,22,23,24,25,...
X764_130520,0,0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..
X800_130701,0,2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...
X218_120425,6,2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...
X266_120430,6,2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...
X276_120430,6,2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...
X411_120503,6,3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...
X448_120507,6,2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...
X517_120515,6,3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...

Except for the first two columns, everything else afterward are 16,380 columns of floating points. The heading for these floating points are integer IDs of human genes and are not contiguous natural numbers.

WHAT I WANT:

I am using PostgreSQL 9.5.6 on 64-bit Linux based on Ubuntu 16.04.4

I want to put the 16,380 numbers into a REAL array of size 16,380 in PostgreSQL, but I am lost as to how to do this.

I want the table in PostgreSQL to look like this:

+------------+-----------+---------------+
| patient_id | diagnosis |    gene_id    |
+------------+-----------+---------------+
| ...        | ...       | {...,...,...} |
| ...        | ...       | {...,...,...} |
| ...        | ...       | {...,...,...} |
+------------+-----------+---------------+

I created the table above with this code:

CREATE TABLE rosmap_rnaseq_entrez (
    patient_id VARCHAR(20),
    diagnosis INTEGER,
    gene_id REAL[16380]
);

I just need to find a proper SQL command to import the CSV file into the table above.

Is hat the actual file you display or some display you chose? As always: your version of Postgres please. — Erwin Brandstetter, 13 hours ago
@ErwinBrandstetter My PostgreSQL is version 9.5.6 on 64-bit Linux based on Ubuntu 16.04. — Dobob, 13 hours ago
I don't understand the problem from looking at this. Is that the import file itself, or a view of it. — Evan Carroll, 13 hours ago
@EvanCarroll I updated it to include first few lines and first few columns of the raw file. — Dobob, 13 hours ago

Evan Carroll · Accepted Answer · 2017-03-10 20:06:15Z

Perl

Not golfing, and not trying to write the best perl script, here is a one-liner quick transform script that does what you want.

perl -wnlE'next if /^$/ or $. == 1; s/,\.+$//; my ($p,$d,$s) = (split /,/,$_,3); say join ",", $p, $d,qq["{$s}"]'

I'll break it down..

# skips empty lines and the first line (header)
next if /^$/ or $. == 1;

# replaces ,... at the end of every file (don't use ... in samples)
s/,\.+$//;

# patientid, diag, stuff to stuff in array.
my ($p, $d, $s) = (split /,/, $_, 3);

# join the first two and output the last one. wrapped in array.
say join ",", $p, $d, qq["{$s}"];

Output

X764_130520,0,"{0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77}"
X800_130701,0,"{2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92}"
X218_120425,6,"{2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83}"
X266_120430,6,"{2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03}"

Execution

perl -wnE'above transform' ./dat.txt  > out.txt
COPY rosmap_rnaseq_entrez FROM '/tmp/out.txt' CSV DELIMITER ',';

TABLE rosmap_rnaseq_entrez ;
 patient_id  | diagnosis |                          gene_id                           
-------------+-----------+------------------------------------------------------------
 X764_130520 |         0 | {0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77}
 X800_130701 |         0 | {2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92}
 X218_120425 |         6 | {2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83}
 X266_120430 |         6 | {2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03}
(4 rows)

Golf

Using this method you could even go for something more compact

perl -pwnlE'$_=sprintf(q[%s,%s,"{%s}"],split/,/,$_,3)'

@EvanCaroll The length of the array might be causing an error. I can store the shorter array literal, but when I run it on the actual array of size 16,380, I am getting "Error: malformed array literal: "{2.45,...} SQL state: 22P02 Detail: Unexpected "," character." — Dobob, 4 hours ago
It' is adding a new line just after the array ends, just before "}"" character, causing the error. — Dobob, 3 hours ago

Dudu Markovitz · Answer 2 · 2017-03-10 19:20:39Z

Couldn't resist :-)

<data.txt perl -pe 's/((.*?,){2})(.*)/\1"{\3}"/'

X764_130520,0,"{0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..}"
X800_130701,0,"{2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...}"
X218_120425,6,"{2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...}"
X266_120430,6,"{2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...}"
X276_120430,6,"{2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...}"
X411_120503,6,"{3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...}"
X448_120507,6,"{2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...}"
X517_120515,6,"{3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...}"

Erwin Brandstetter · Answer 3 · 2017-03-11 02:06:32Z

You don't need any auxiliary tools. This works on any platform.
COPY to a simple temporary ancillary table:

CREATE TEMP TABLE tmp(txt text);

COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');

I chose backspace as delimiter (E'\b') which never occurs. This way you get one text column.
Assuming UTF8 encoding. Adapt if necessary.
If you can't use SQL COPY, consider the psql meta command \copy instead:

Postgres client copy (\copy) command doesn't have access to a temporary table?

Create the target table by parsing the row in the next step:

CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
     , split_part(txt, ',', 2) AS diagnosis
     , string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM   tmp;

Simple and fast.

The temporary table dies at the end of the session automatically.
The first line of integer values can be stored in real[], too.
About split_part():

Split comma separated column data into additional columns

The regular expression in substring(): '^(?:[^,]*,){2}(.*)' - it basically says: omit the first two fields and take the rest.
This assumes no white space between values else you might want to use trim(). Related answer with more explanation and links for regular expressions:

Filter strings with regex before casting to numeric

Since an array does not seem useful for your purpose, you might instead just keep the comma-separated list:

     substring(txt, '^(?:[^,]*,){2}(.*)') AS list

This is the cleanest solution. It would be interesting to know if it's faster than the other solutions, but my guess is that Perl is faster on string manipulation. — pietrop, 1 hour ago

asked	today
viewed	110 times
active	today

current community

your communities

more stack exchange communities

PostgreSQL importing thousands of columns as an array

3 Answers 3

Perl

Execution

Golf

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql postgresql-9.5 etl or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

PostgreSQL importing thousands of columns as an array

3 Answers 3

Perl

Execution

Golf

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql postgresql-9.5 etl or ask your own question.

Linked

Related

Hot Network Questions