Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

PROBLEM:

I have a CSV file with 16,382 columns with data that looks like this:

+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| PATIENT_ID  | DIAGNOSIS |  1   |   2   |  3   |  9   |  10  | 13  |  14   | ... |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| X764_130520 | 0         | 0.35 | 9.68  | 0.11 | 0.04 | 0.03 | 0   | 32.54 |     |
| X800_130701 | 0         | 2.24 | 32.04 | 0.13 | 0.34 | 0.04 | 0   | 39.47 | ... |
| X164_120423 | 6         | 3.12 | 24.08 | 0.04 | 0.1  | 0.08 | 0   | 73.47 | ... |
| X218_120425 | 6         | 2.48 | 20.62 | 0    | 0.22 | 0    | 0   | 59.06 | ... |
| ...         | ...       | ...  | ...   | ...  | ...  | ...  | ... | ...   |     |
| X824_130725 | 1         | 0.76 | 44.77 | 0.1  | 0.28 | 0.13 | 0   | 45.35 |     |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+

RAW FORMAT:

PATIENT_ID,DIAGNOSIS,1,2,3,9,10,13,14,15,16,18,19,20,21,22,23,24,25,...
X764_130520,0,0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..
X800_130701,0,2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...
X218_120425,6,2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...
X266_120430,6,2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...
X276_120430,6,2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...
X411_120503,6,3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...
X448_120507,6,2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...
X517_120515,6,3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...

Except for the first two columns, everything else afterward are 16,380 columns of floating points. The heading for these floating points are integer IDs of human genes and are not contiguous natural numbers.

WHAT I WANT:

I am using PostgreSQL 9.5.6 on 64-bit Linux based on Ubuntu 16.04.4

I want to put the 16,380 numbers into a REAL array of size 16,380 in PostgreSQL, but I am lost as to how to do this.

I want the table in PostgreSQL to look like this:

+------------+-----------+---------------+
| patient_id | diagnosis |    gene_id    |
+------------+-----------+---------------+
| ...        | ...       | {...,...,...} |
| ...        | ...       | {...,...,...} |
| ...        | ...       | {...,...,...} |
+------------+-----------+---------------+

I created the table above with this code:

CREATE TABLE rosmap_rnaseq_entrez (
    patient_id VARCHAR(20),
    diagnosis INTEGER,
    gene_id REAL[16380]
);

I just need to find a proper SQL command to import the CSV file into the table above.

share|improve this question
2  
Is hat the actual file you display or some display you chose? As always: your version of Postgres please. – Erwin Brandstetter 13 hours ago
    
@ErwinBrandstetter My PostgreSQL is version 9.5.6 on 64-bit Linux based on Ubuntu 16.04. – Dobob 13 hours ago
    
I don't understand the problem from looking at this. Is that the import file itself, or a view of it. – Evan Carroll 13 hours ago
    
@EvanCarroll It is a CSV file. It is not imported yet. – Dobob 13 hours ago
1  
@EvanCarroll I updated it to include first few lines and first few columns of the raw file. – Dobob 13 hours ago
up vote 2 down vote accepted

Perl

Not golfing, and not trying to write the best perl script, here is a one-liner quick transform script that does what you want.

perl -wnlE'next if /^$/ or $. == 1; s/,\.+$//; my ($p,$d,$s) = (split /,/,$_,3); say join ",", $p, $d,qq["{$s}"]'

I'll break it down..

# skips empty lines and the first line (header)
next if /^$/ or $. == 1;

# replaces ,... at the end of every file (don't use ... in samples)
s/,\.+$//;

# patientid, diag, stuff to stuff in array.
my ($p, $d, $s) = (split /,/, $_, 3);

# join the first two and output the last one. wrapped in array.
say join ",", $p, $d, qq["{$s}"];

Output

X764_130520,0,"{0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77}"
X800_130701,0,"{2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92}"
X218_120425,6,"{2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83}"
X266_120430,6,"{2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03}"

Execution

perl -wnE'above transform' ./dat.txt  > out.txt
COPY rosmap_rnaseq_entrez FROM '/tmp/out.txt' CSV DELIMITER ',';

TABLE rosmap_rnaseq_entrez ;
 patient_id  | diagnosis |                          gene_id                           
-------------+-----------+------------------------------------------------------------
 X764_130520 |         0 | {0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77}
 X800_130701 |         0 | {2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92}
 X218_120425 |         6 | {2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83}
 X266_120430 |         6 | {2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03}
(4 rows)

Golf

Using this method you could even go for something more compact

perl -pwnlE'$_=sprintf(q[%s,%s,"{%s}"],split/,/,$_,3)'
share|improve this answer
    
@EvanCaroll The length of the array might be causing an error. I can store the shorter array literal, but when I run it on the actual array of size 16,380, I am getting "Error: malformed array literal: "{2.45,...} SQL state: 22P02 Detail: Unexpected "," character." – Dobob 4 hours ago
    
It' is adding a new line just after the array ends, just before "}"" character, causing the error. – Dobob 3 hours ago
    
nevermind, I wrote some python and fixed it. – Dobob 1 hour ago

Couldn't resist :-)

<data.txt perl -pe 's/((.*?,){2})(.*)/\1"{\3}"/'

X764_130520,0,"{0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..}"
X800_130701,0,"{2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...}"
X218_120425,6,"{2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...}"
X266_120430,6,"{2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...}"
X276_120430,6,"{2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...}"
X411_120503,6,"{3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...}"
X448_120507,6,"{2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...}"
X517_120515,6,"{3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...}"
share|improve this answer

You don't need any auxiliary tools. This works on any platform.
COPY to a simple temporary ancillary table:

CREATE TEMP TABLE tmp(txt text);

COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');

I chose backspace as delimiter (E'\b') which never occurs. This way you get one text column.
Assuming UTF8 encoding. Adapt if necessary.
If you can't use SQL COPY, consider the psql meta command \copy instead:

Create the target table by parsing the row in the next step:

CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
     , split_part(txt, ',', 2) AS diagnosis
     , string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM   tmp;

Simple and fast.

The temporary table dies at the end of the session automatically.
The first line of integer values can be stored in real[], too.
About split_part():

The regular expression in substring(): '^(?:[^,]*,){2}(.*)' - it basically says: omit the first two fields and take the rest.
This assumes no white space between values else you might want to use trim(). Related answer with more explanation and links for regular expressions:

Since an array does not seem useful for your purpose, you might instead just keep the comma-separated list:

     substring(txt, '^(?:[^,]*,){2}(.*)') AS list
share|improve this answer
    
This is the cleanest solution. It would be interesting to know if it's faster than the other solutions, but my guess is that Perl is faster on string manipulation. – pietrop 1 hour ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.