Python/Django/MySQL “Incorrect string value” error

Question

I'm running a Django 1.4.2/Python 2.7.3/MySQL 5.5.28 site. One of the features of the site is that the admin can send an email to the server which calls a Python script via procmail that parses the email and tosses it into the DB. I maintain two versions of the site - a development and a production site. Both sites use different but identical vitualenvs (I even deleted them both and reinstalled all packages just to make sure).

I'm experiencing a weird issue. The exact same script succeeds on the dev server and fails on the production server. It fails with this error:

...django/db/backends/mysql/base.py:114: Warning: Incorrect string value: '\x92t kno...' for column 'message' at row 1

I'm well aware of the unicode issues Django has, and I know there are a ton of questions here on SO about this error, but I made sure to setup the database as UTF-8 from the beginning:

mysql> show variables like "character_set_database";
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | utf8  |
+------------------------+-------+
1 row in set (0.00 sec)

mysql> show variables like "collation_database";
+--------------------+-----------------+
| Variable_name      | Value           |
+--------------------+-----------------+
| collation_database | utf8_general_ci |
+--------------------+-----------------+
1 row in set (0.00 sec)

Additionally, I know that each column can have its own charset, but the message column is indeed UTF-8:

mysql> show full columns in listserv_post;
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field      | Type         | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id         | int(11)      | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| thread_id  | int(11)      | NULL            | NO   | MUL | NULL    |                | select,insert,update,references |         |
| timestamp  | datetime     | NULL            | NO   |     | NULL    |                | select,insert,update,references |         |
| from_name  | varchar(100) | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| from_email | varchar(75)  | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| message    | longtext     | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
6 rows in set (0.00 sec)

Does anyone have any idea why I'm getting this error? Why is it happening under the production config but not the dev config?

Thanks!

[edit 1]
To be clear, the data are the same as well. I send a single email to the server, and procmail sends it off. This is what the .procmailrc looks like:

VERBOSE=off
:0
{
    :0c
    | <path>/dev/ein/scripts/process_new_mail.py dev > outputdev

    :0
    | <path>/prd/ein/scripts/process_new_mail.py prd > outputprd
}

There are 2 copies of process_new_mail.py, but that's just because it's version controlled so that I can maintain two separate environments. If I diff the two output files (which contain the message received), they're identical.

[edit 2] I actually just discovered that both dev and prd configs are failing. The difference is that the dev config fails silently (maybe having to do with the DEBUG setting?). The problem is that there are some unicode characters in one of the messages, and Django is choking on them for some reason. I'm making progress....

I've tried editing the code to explicitly encode the message as ASCII and UTF-8, but it's still not working. I'm getting closer, though.

You say the code is identical, but what about the data? Maybe the bug exists in the dev environment as well, you only don't know about it. Try adding that row to the dev environment and see if the same error happens. — mgibsonbr, Nov 19 '12 at 23:47
The data are exactly the same, too. I'm sending an email to the server, and I have procmail calling both the dev and production scripts simultaneously. I've even outputted the message just to make sure and diffed them, and they're identical. — Geoff, Nov 19 '12 at 23:49
My intuition may be wrong, but it just seems like it should be a DB issue. There are two different DBs, one for dev and one for prd, but I can't spot a difference between the two that would cause this problem. — Geoff, Nov 19 '12 at 23:57
Also, to better determine whether the error is happening in the database or the python code, try swapping the DBs: have the dev environment access the prd database and vice-versa, and see which one fails. — mgibsonbr, Nov 20 '12 at 0:03
Are you using the same connection options for both databases? Connection options like charset and use_unicode will make a difference. — Pedro Romano, Nov 20 '12 at 0:13

Geoff · Accepted Answer · 2012-11-20 20:15:02Z

I fixed it! The problem was that I wasn't parsing the email correctly with respect to the charsets. My fixed email parsing code comes from this post and this post:

#get the charset of an email
#courtesy http://ginstrom.com/scribbles/2007/11/19/parsing-multilingual-email-with-python/
def get_charset(message, default='ascii'):
    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

#courtesy http://stackoverflow.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard
def get_body(message):
    body = None

    #Walk through the parts of the email to find the text body.
    if message.is_multipart():
        for part in message.walk():
            #If part is multipart, walk through the subparts.
            if part.is_multipart():
                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        #Get the subpart payload (i.e., the message body).
                        charset = get_charset(subpart, get_charset(message))
                        body = unicode(subpart.get_payload(decode=True), charset)
            #Part isn't multipart so get the email body.
            elif part.get_content_type() == 'text/plain':
                charset = get_charset(subpart, get_charset(message))
                body = unicode(part.get_payload(decode=True), charset)
    #If this isn't a multi-part message then get the payload (i.e., the message body).
    elif message.get_content_type() == 'text/plain':
        charset = get_charset(subpart, get_charset(message))
        body = unicode(message.get_payload(decode=True), charset)

    return body

Thanks very much for the help!

asked	1 year ago
viewed	1158 times
active	1 year ago

current community

your communities

more stack exchange communities

Python/Django/MySQL “Incorrect string value” error

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python mysql django unicode or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python/Django/MySQL “Incorrect string value” error

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python mysql django unicode or ask your own question.

Related

Hot Network Questions