MySQL-to-PostgreSQL migration script

Question

I'm working on a Python script to migrate a MySQL database into a PostgreSQL database with a different Schema (different table structures, different datatypes and so on).

I'm a sysadmin and unfortunately I don't code very often. So I'm having some doubts about this initial programming phase.

I begin with the tables that are easy to be migrate (almost the same structure), but very soon I will have to transfer tables that need more operations to be converted for compatability.

My code actually looks like this:

#!/usr/bin/python

# Script Name: database-migration_msql-psql.py
# Description: Migrate mysql database a2
#              into postgresql database a3.
# Created By:  phphil.
# Date:        7 Oct 2015.
#

# ------------------------------
# Import standard libraries    |
# ------------------------------
#
import os
import sys
import mysql.connector
import psycopg2
from pprint import pprint
import MySQLdb

# ------------------------------
# Import internal snippets     |
# ------------------------------
#
from include.db_config import *
#from include.MySQLCursorDict import *

# ------------------------------
# Open database connections    |
# ------------------------------
#
# Mysql connection
try:
  cnx_msql = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
  print "MYSQL: Unable to connect!", e.msg
  sys.exit(1)

# Postgresql connection
try:
  cnx_psql = psycopg2.connect(conn_string_psql)
except psycopg2.Error as e:
  print('PSQL: Unable to connect!\n{0}').format(e)
  sys.exit(1)

# Cursors initializations
cur_msql = cnx_msql.cursor(dictionary=True)
cur_psql = cnx_psql.cursor()


# ------------------------------
# A2.right > A3.permission     |
# ------------------------------
#
cur_msql.execute("SELECT resource_id, user_id, group_id, right_type_id, inheritance, mac FROM rights")

for row in cur_msql:

  ### check if owner_id is a user or a group
  if row['user_id'] == 0:
    row['user_id'] = row['group_id']
  else:
    pass

  try:
    cur_psql.execute("INSERT INTO permissions (resource_id, owner_id, level, inheritance, mac) \
                      VALUES (%(resource_id)s, %(user_id)s, %(right_type_id)s, %(inheritance)s, %(mac)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -------------------------------
# A2.membership > A3.membership |
# -------------------------------
#
cur_msql.execute("SELECT group_id, user_id, update_date, update_user_id, status FROM memberships")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO memberships (group_id, user_id, update_date, update_member_id, approval_date, status) \
                      VALUES (%(group_id)s, %(user_id)s, %(update_date)s, %(update_user_id)s, %(update_date)s, %(status)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -------------------------------
# A2.user_keys > A3.user_keys   |
# -------------------------------
#
cur_msql.execute("SELECT resource_id, user_key FROM user_keys")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO user_keys (resource_id, user_key) VALUES (%(resource_id)s, %(user_key)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -----------------------------------------------------
# A2.user_password_history > A3.user_password_history |
# -----------------------------------------------------
#
cur_msql.execute("SELECT resource_id, timestamp, password FROM user_password_history")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO user_password_history (resource_id, timestamp, password) \
                      VALUES (%(resource_id)s, %(timestamp)s, %(password)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -------------------------------
# A2.log > A3.log               |
# -------------------------------
# parent_id not used in A2, took by joining resources table. refered_to & detail fields too difficult to migrate, left empty in A3
#
cur_msql.execute("SELECT l.log_id, l.resource_id, l.community_id, r.parent_id, l.timestamp, l.user_id, l.action, l.type, l.subtype \
                  FROM log l, resources r \
                  WHERE l.resource_id=r.resource_id")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO log(log_id, resource_id, community_id, parent_id, timestamp, user_id, action, type, subtype, detail, refered_to) \
                          VALUES(%(log_id)s, %(resource_id)s, %(community_id)s, %(parent_id)s, %(timestamp)s, %(user_id)s, %(action)s, %(type)s, %(subtype)s, NULL, DEFAULT)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -----------------------------------------------------
# A2.notification_settings > A3.notification_settings |
# -----------------------------------------------------
#
cur_msql.execute("SELECT resource_id, normal_medium, normal_delivery, normal_days, normal_time, high_medium, high_delivery, high_days, high_time, editable FROM notification_settings")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO notification_settings (resource_id, normal_channel, normal_delivery, normal_days, normal_time, high_channel, high_delivery, high_days, high_time, editable) \
                      VALUES (%(resource_id)s, %(normal medium)s, %(normal_delivery)s, %(normal_days)s, %(normal_time)s, %(high_medium)s, %(high_delivery)s, %(high_days)s, %(high_time)s, %(editable)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


# -----------------------------------------------------
# A2.resource_notification > A3.resource_notification |
# -----------------------------------------------------
#
cur_msql.execute("SELECT resource_id, notification_to FROM resources_notifications")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO resources_notifications (resource_id, notification_to) VALUES (%(resource_id)s, %(notification_to)s)", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit("Some problem occured with that query! leaving early this lollapalooza script")


################ END OF SCRIPT ################
# ---------------------------------------------
# Finalizing stuff & closing db connections   |
# ---------------------------------------------
#

## Closing cursors
cur_msql.close()
cur_psql.close()

## Committing 
cnx_psql.commit()

## Closing database connections
cnx_msql.close()
cnx_psql.close()

As you will notice, in each section of the script the structure is almost the same:

Select data from a table of the source database (mysql), the result is handled by a cursor with dictionary flag (a python dictionary).
After this, the dictionary is iterated within a for loop where possible e.g. casting fields, or adapt the table structure (see section: A2.right > A3.permission).
And still inside the for loop, each record is inserted in the destination database.

Questions/Doubts:

Do I need to create a class in order to abstract the redundant code? Or maybe it's better to just create a function? Can someone post a short example? I have no Idea how to proceed.
In both cases I see some problems on abstracting it because the redundant code is inside a loop where I will have to do different operations depending on what table I'm iterating.
I used to open and close cursors at each operation(script section), then I decided to open both cursors at the beginning of the script, and use them until the end and close them. But now I've read this and I'm confused. What's better in your opinion? One cursor for each operation, or one cursor for the whole script?

I have rolled back Rev 7 → 4. Please see What to do when someone answers. — 200_success♦, 3 hours ago

Barry · Accepted Answer · 2015-10-12 13:40:39Z

This code is fine. As far as scripts go, I'm assuming this is something that is run once and then forgotten about. But of course, in the interest of code review...

Don't Repeat Yourself

You have seven blocks of code that look something like:

cur_msql.execute("SELECT ...")

for row in cur_msql:
  try:
    cur_psql.execute("INSERT INTO ...", row)
  except psycopg2.Error as e:
    print "cannot execute that query!!", e.pgerror
    sys.exit(...)

That's just crying out for a function:

def select_and_insert(select_query, insert_query):
    cur_msql.execute(select_query)

    try:
        for row in cur_msql:
            cur_psql.execute(insert_query, row)
        except psycopg2.Error as e:
            print 'failed to execute query', e.pgerror
            sys.exit(...)

That way, you just have 7 calls to select_and_insert, and all you're writing is the various queries without having to repeat all the extra logic.

If statements

In Python, unlike some functional languages, not every statement needs to evaluate to something. So in this check:

if row['user_id'] == 0:
  row['user_id'] = row['group_id']
else:
  pass

The else: pass is unnecessary and redundant. You could simply have done the row check.

Of course, this looks like it breaks the don't repeat yourself idea of the function - but we can simply move this if statement into the SQL query:

select ..., case when user_id == 0 then group_id else user_id end as user_id, ...

No if necessary.

Efficiency

Running one insert statement per row is very inefficient, since you can run one insert statement for all the rows. Now that we've refactored this, we only have to change it in one function. Check out this answer for how we might go about doing such a thing and what the performance implication is. We're talking order of magnitude.

Thankyou for the useful tips, I've edited my question in order to add a new difficulty I've met while trying to implement this performance fix — lese, 11 hours ago
Forget it I was looking to fix it in the wrong section, the problem was in mysql query, now it works : ) For the functional programming im still not sure I want to implement it, I'm scared it will make the work more complex by having to fix/transform data directly using SQL as your example — lese, 11 hours ago

SuperBiasedMan · Answer 2 · 2015-10-12 13:38:50Z

Yes, you'd be better off doing a function. There's no real need for a class, but a function allows you to just pass in the string command that you need for each part, it could look like this:

def psql_command(msql, psql, msql_command, psql_command):

    msql.execute(msql_command)
    for row in cur_msql:
        try:
            psql.execute(command, row)
        except psycopg2.Error as e:
            print "cannot execute that query!!", e.pgerror
            sys.exit("Some problem occured with that query! leaving early this lollapalooza script")

psql_command(cur_msql, cur_psql, "SELECT group_id, user_id, update_date, update_user_id, status FROM memberships",
             "INSERT INTO memberships (group_id, user_id, update_date, update_member_id, approval_date, status) \
              VALUES (%(group_id)s, %(user_id)s, %(update_date)s, %(update_user_id)s, %(update_date)s, %(status)s)"

Of course, now you could call each of the commands even easier if you just build a list of the command pairs and loop over them, like this:

commands = [("SELECT group_id, user_id, update_date, update_user_id, status FROM memberships",
             "INSERT INTO memberships (group_id, user_id, update_date, update_member_id, approval_date, status) \
              VALUES (%(group_id)s, %(user_id)s, %(update_date)s, %(update_user_id)s, %(update_date)s, %(status)s)"),
            ("SELECT resource_id, user_key FROM user_keys",
             "INSERT INTO user_keys (resource_id, user_key) VALUES (%(resource_id)s, %(user_key)s)"),
            ("SELECT resource_id, timestamp, password FROM user_password_history", 
             "INSERT INTO user_password_history (resource_id, timestamp, password) \
              VALUES (%(resource_id)s, %(timestamp)s, %(password)s)"),
            ("SELECT l.log_id, l.resource_id, l.community_id, r.parent_id, l.timestamp, l.user_id, l.action, l.type, l.subtype \
              FROM log l, resources r \
              WHERE l.resource_id=r.resource_id",
             "INSERT INTO log(log_id, resource_id, community_id, parent_id, timestamp, user_id, action, type, subtype, detail, refered_to) \
              VALUES(%(log_id)s, %(resource_id)s, %(community_id)s, %(parent_id)s, %(timestamp)s, %(user_id)s, %(action)s, %(type)s, %(subtype)s, NULL, DEFAULT)"),]

for msql_command, psql_command in commands:
    psql_command(cur_msql, cur_psql, msql_command, psql_command)

This reduces your lines dramatically and makes it easier to add changes to how the code works. For the case of your one operation that doesn't fit this. I'd personally make it a separate function. When you have more exceptions like it, the question you need to ask is whether it's more work to make a function fit multiple conditions or to just make a specific function to fit the new case, even if it's repeating some code.

200_success · Answer 3 · 2015-10-13 20:33:38Z

up vote 2 down vote

First of all, consider using a different migration method altogether. A Foreign Data Wrapper would let you query MySQL tables as if they were part of the PostgreSQL database.

You are trying to do two things at once: transfer the data, and change the schema. For clarity, consider doing them as separate steps. One way to do it would be INSERT INTO TEMPORARY TABLE tmp_memberships … to copy the data into PostgreSQL verbatim (which you can skip if using a Foreign Data Wrapper), followed by

INSERT INTO memberships (group_id, user_id, update_date, update_member_id, approval_date, status)
                 SELECT  group_id, user_id, update_date, update_user_id, update_date, status
                     FROM tmp_memberships;

… to rename and add a column.

edited 1 hour ago

answered 2 hours ago

200_success♦
76.1k787282

I see why you think I'm trying to do two things at once, I've omitted some important informations excuse me! the strategy is to first import (restore) the new database schema into pgsql , no data, no foreign-keys. And only then I will run this script to migrate the data, and at the end I will have to "re-activate" the foreign-keys ( that sounds scary now o,O ) – lese 1 hour ago

That's fine. In fact, disabling indexes and constraints is the recommended approach if speed matters. – 200_success♦ 1 hour ago

Great links man, I will read everything very carefully. About your recommend of using only one transaction, I was under the impression that I was already doing it : psql commit only at the end , and script exit without commit if some exception is raised. I'm still missing the rollback command inside the exception block – lese 1 hour ago

I stand corrected! psycopg2 does not do autocommit by default. – 200_success♦ 1 hour ago

You still have helped me figure out I'm missing rollback Thankyou! – lese 1 hour ago

add a comment |

asked	yesterday
viewed	91 times
active	today

current community

your communities

more stack exchange communities

MySQL-to-PostgreSQL migration script

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python beginner postgresql cursor or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

MySQL-to-PostgreSQL migration script

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python beginner postgresql cursor or ask your own question.

Related

Hot Network Questions