Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

I have several billion rows of data in CSV files. Each row can have anything from 10 to 20 columns. I want to use COPY FROM to load the data into a table containing 20 columns. If a specific CSV row only contains 10 columns of data, then I expect COPY FROM to set the rest of the columns (for which the values are missing) to NULL. I specify DEFAULT NULL on every column in the CREATE TABLE statement.

MY QUESTION: Can this be done using COPY FROM?

EDIT: Greenplum (a database based upon PostgreSQL) has a switch named FILL MISSING FIELDS, which does what I describe (see their documentation here). What workarounds would you recommend for PostgreSQL?

share|improve this question
up vote 2 down vote accepted

Write a pre-processing script to just add some extra commas on the lines that don't have enough columns, or to transform the CSV into TSV (tab-separated) and put "\N" in the extra columns.

share|improve this answer

I don't think you can make COPY FROM deal with different number of columns inside the same file.

If it's always the same 10 columns that are missing, a workaround could be to first load everything into a staging table that has a single text column.

After that, you can use SQL to split the line and extract the columns, something like this:

INSERT INTO target_table (col1, col2, col3, col4, col5, ...)
SELECT columns[1], columns[2], ...
FROM ( 
  SELECT string_to_array(big_column, ',') as columns
    FROM staging_table 
) t
WHERE array_length(columns) = 10

and then do a similar thing with array_length(columns) = 20

share|improve this answer
    
This seems like a way to do this, but I am concerned about the performance as all data needs to be inserted into two tables. – David Jan 4 '11 at 10:50
    
I don't see a different way, unless you can change the creation process of the CSV files – a_horse_with_no_name Jan 4 '11 at 12:30

In a context of etl and data-warehouse -- my suggestion would be to actually avoid the "shortcut" you are looking for.

ETL is a process, frequently implemented as ECCD (Extract, Clean, Conform, Deliver). You could treat those files as "Extracted", so simply implement data cleaning and conforming as different steps -- you will need some extra disk space for that. All conformed files should have the "final" (all columns) structure. Then deliver (COPY FROM) those conformed files.

This way you will also be able to document the ETL process and what happens to the missing fields in each step.

It is a usual practice to archive (disk, DVD) original customer files and conformed versions for audit and debug purposes.

share|improve this answer
    
Thanks a lot. I was feeling down to actually have to go that one step more, but after reading this it seems the proper way to go anyhow : ) – Smalcat Feb 19 '13 at 12:22

From the PostgreSQL manual:

COPY FROM will raise an error if any line of the input file contains more or fewer columns than are expected.

Read the first line of your CSV file to see how many columns you have to name in the COPY statement.

share|improve this answer
    
What workarounds would you recommend? – David Jan 4 '11 at 9:25
1  
Write a script which will pre-process files eg check number of delimiters in some are missing add them. If you like writing scritps – ETL Man Jan 25 '11 at 23:39

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.