1

I am writing a db data migration tool using SQLAlchemy's expression language as the basic tool.

My source database may be in UTF8, or it may be in SQL_ASCII. My target DB will always be in UTF8.

I'm using the psycopg2 driver in SQLAlchemy 0.6.6

My general migration process looks like this:

for t in target_tables:
    log.info("Migrating data from %s", t.fullname)
    source = self.source_md.tables[self.source_schema + "." + t.name]
    for row in source.select().execute():
        with sql_logging(logging.INFO):
            conn.execute(t.insert(), row)

If I don't set anything encoding-related on the engines, I get this when I iterate over the select() results:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

If I set use_native_unicode=True, encoding='utf-8' on the engines, I get this when I attempt to insert the new row:

sqlalchemy.exc.DataError: (DataError) invalid byte sequence for encoding "UTF8": 0xeb6d20
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
 'INSERT INTO project_ghtests_survey000005.employees (first_name, employee_id) VALUES (%(first_name)s, %(employee_id)s)' {'first_name': 'Art\xebm', 'employee_id': '1234'}

Update details

To make inquires a bit quicker, here's the software stack in play:

  • source_db encoding: SQL_ASCII
  • target_db encoding: UTF8
  • python 2.7
  • sqlalchemy 0.6.6
  • psycopg2 2.2.2
  • PostgreSQL 8.2 server

2 Answers 2

1

It turns out that the solution was to set the connection client_encoding to 'latin1'

I accomplished this using a PoolListener like so:

class EncodingListener(PoolListener):

    def connect(self, dbapi_con, con_record):
        with closing(dbapi_con.cursor()) as cur:
            cur.execute('show client_encoding')
            encoding = cur.fetchone()[0]

        if encoding.upper() == 'UTF8':
            return

        dbapi_con.set_client_encoding('latin1')
0

Since UTF-8 is backwards compatible with UTF-8, why use SQL_ASCII isntead of UTF8?

I think your encoding problems are probably more along the lines of latin1 or similar encodings. Not with ASCII to UTF8.

2
  • What do you mean by that? The DB is already in SQL_ASCII, I don't have a lot of say in that. Commented Dec 15, 2010 at 17:49
  • 1
    @Chris R: I mean that it doesn't matter if you specify SQL_ASCII or UTF8 for the source database since UTF8 is backwards compatible with ASCII. Commented Dec 15, 2010 at 22:25

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.