I'm loading data from files and some of the files have text that is ISO-8859 encoded mixed-in with text that is UTF-8 encoded. postgres stops and shows an error when the ISO-8859 encoded lines are encountered. For small files, I can fix or edit out the few bad lines, but for very large files (hundreds of MB) that doesn't scale. So here is the question:

Can postgres skip the ISO-8859 encoded lines and just load the UTF-8 ones?

Here is the command I use to load files:

COPY words (word, source_id) FROM '/path/to/file.txt' with delimiter E'\t';

Here is the error postgres shows when a bad line is encountered:

ERROR:  invalid byte sequence for encoding "UTF8": 0xa2
CONTEXT:  COPY words, line 2268

Thanks for any advice.

Edit: There are also other errors (beside encoding) that stop the load. For example, when a string is longer than a column permits, etc. I could write a program to loop through the file loading each line individually and logging errors, I was just curious if postgres had something to handle this built-in. I edited the title to be more generic.

share|improve this question
Have you tried to filter the data through iconf before importing ? – wildplasser Dec 9 '12 at 17:30
You mean iconv, not iconf... right? That will work to convert the files, but I was curious if there was a way to make postgres load lines that have the correct encoding and skip those that do not. – Brad Tilley Dec 9 '12 at 17:44
Sorry, typo indeed. You can make iconv ignore the bad tokens. Otherwise you could make an hardcoded filter to catch and replace the "unparsable" bytesequences. (either single-byte 0x80-0xff, or multibyte with a bad firstbyte <--> number-of-subsequent-bytes-with-topbit-set combination) – wildplasser Dec 9 '12 at 17:56
feedback

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

 
or
required, but never shown
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.