Postgresql: COPY FROM csv file with occational missing columns

Question

I have several billion rows of data in CSV files. Each row can have anything from 10 to 20 columns. I want to use COPY FROM to load the data into a table containing 20 columns. If a specific CSV row only contains 10 columns of data, then I expect COPY FROM to set the rest of the columns (for which the values are missing) to NULL. I specify DEFAULT NULL on every column in the CREATE TABLE statement.

MY QUESTION: Can this be done using COPY FROM?

EDIT: Greenplum (a database based upon PostgreSQL) has a switch named FILL MISSING FIELDS, which does what I describe (see their documentation here). What workarounds would you recommend for PostgreSQL?

araqnid · Accepted Answer · 2011-01-04 11:25:10Z

up vote 2 down vote accepted

Write a pre-processing script to just add some extra commas on the lines that don't have enough columns, or to transform the CSV into TSV (tab-separated) and put "\N" in the extra columns.

answered Jan 4 '11 at 11:25

araqnid

49.4k148891

add a comment |

a_horse_with_no_name · Answer 2 · 2011-01-04 09:28:45Z

up vote 1 down vote

I don't think you can make COPY FROM deal with different number of columns inside the same file.

If it's always the same 10 columns that are missing, a workaround could be to first load everything into a staging table that has a single text column.

After that, you can use SQL to split the line and extract the columns, something like this:

INSERT INTO target_table (col1, col2, col3, col4, col5, ...)
SELECT columns[1], columns[2], ...
FROM ( 
  SELECT string_to_array(big_column, ',') as columns
    FROM staging_table 
) t
WHERE array_length(columns) = 10

and then do a similar thing with array_length(columns) = 20

answered Jan 4 '11 at 9:28

a_horse_with_no_name

160k20188263

This seems like a way to do this, but I am concerned about the performance as all data needs to be inserted into two tables. – David Jan 4 '11 at 10:50

I don't see a different way, unless you can change the creation process of the CSV files – a_horse_with_no_name Jan 4 '11 at 12:30

add a comment |

Damir Sudarevic · Answer 3 · 2011-01-04 13:27:09Z

In a context of etl and data-warehouse -- my suggestion would be to actually avoid the "shortcut" you are looking for.

ETL is a process, frequently implemented as ECCD (Extract, Clean, Conform, Deliver). You could treat those files as "Extracted", so simply implement data cleaning and conforming as different steps -- you will need some extra disk space for that. All conformed files should have the "final" (all columns) structure. Then deliver (COPY FROM) those conformed files.

This way you will also be able to document the ETL process and what happens to the missing fields in each step.

It is a usual practice to archive (disk, DVD) original customer files and conformed versions for audit and debug purposes.

Thanks a lot. I was feeling down to actually have to go that one step more, but after reading this it seems the proper way to go anyhow : ) — Smalcat, Feb 19 '13 at 12:22

Frank Heikens · Answer 4 · 2011-01-04 09:20:26Z

up vote 0 down vote

From the PostgreSQL manual:

COPY FROM will raise an error if any line of the input file contains more or fewer columns than are expected.

Read the first line of your CSV file to see how many columns you have to name in the COPY statement.

answered Jan 4 '11 at 9:20

Frank Heikens

42.9k117584

What workarounds would you recommend? – David Jan 4 '11 at 9:25

1

Write a script which will pre-process files eg check number of delimiters in some are missing add them. If you like writing scritps – ETL Man Jan 25 '11 at 23:39

add a comment |

asked	5 years ago
viewed	4390 times
active	5 years ago

current community

your communities

more stack exchange communities

Postgresql: COPY FROM csv file with occational missing columns

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql data-warehouse etl or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Postgresql: COPY FROM csv file with occational missing columns

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql data-warehouse etl or ask your own question.

Visit Chat

Related

Hot Network Questions