Improving Python script to drop HTML part of multipart/mixed e-mails

Question

I have a working script that I wrote in Python which for e-mail messages that consist only of a plain text part + an HTML part, discards the HTML and keeps only the plain text part.

The script is not exactly ellegant and, as can be seen from the code, it smells like it is C (in particular, I simulate the use of bitmasks) and I am not exactly satisfied with some points of it.

I know that the script has some issues (like code duplication and the hack mentioned above), but I don't know the idiomatic Pythonic way of writing it and I would appreciate any kind of criticism to improve it in any way to make the code ellegant.

So, without further ado, the code is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Author: Rogério Theodoro de Brito <[email protected]>
License: GPL-2+
Copyright: 2010-2012 Rogério Theodoro de Brito

drop-alternatives is a simple Python script for those who hate emails in
HTML and who prefer their inbox to have as many messages in pure text as
feasible. This script is generally meant to be run as a filter with procmail
or some other mail delivery agent.

It tries to be moderately conservative and only act when things are
moderately safe:

* If the message is `multipart` and has a `text/plain` and a `text/html`
  part, keep the `text/plain` part only.

* In all other cases keep the message intact.
"""

import email
import email.message


def compose_message(orig, body):
    """
    Create new message with headers from `orig` and body from `body`.

    * `orig`: The original message.
    * `body`: The body that we want the new message to have.
    * Returns a new message.

    `compose_message` creates a new message with most of the fields from
    `orig`, with fields from `body` (if any) and with the payload of
    `body`. The fields excluded from `orig` are the following:

    * `content-length`
    * `content-type`
    * `lines`
    * `status`
    """
    wanted = email.message.Message()
    wanted.set_payload(body.get_payload())

    unwanted_fields = ["content-length", "content-type", "lines", "status"]

    # The dictionaries `orig` and `body` have only headers as their items.
    for field in unwanted_fields:
        del orig[field]

    for k, v in orig.items():
        wanted[k] = v
    for k, v in body.items():
        wanted[k] = v

    return wanted


def sanitize(msg):
    """
    Given an RFC-2822 message `msg`, generate its 'sanitized' version.

    * `msg`: The message to be sanitized.
    * Returns a sanitized version of `msg`.

    `sanitize` tries to be moderately conservative and only act when things
    are moderately safe:

    * If the message is multipart and has a `text/plain` and a `text/html`
      part, keep the `text/plain` part only.

    * In all other cases keep the message intact.
    """
    if not msg.is_multipart():
        return msg

    # 'composition' is a bitmask containing the kind of the parts
    TEXTPLAIN = 1  # text/plain
    TEXTHTML = 2  # text/html
    MISCPARTS = 4  # anything else

    composition = 0
    text_taken = False

    for part in msg.walk():
        if (part.get_content_maintype() == "multipart" or
            part.get_content_type() == "message/external-body" or
            part.get_payload() == ""):
            continue
        elif part.get_content_type() == "text/plain":
            if not text_taken:
                text_taken = True
                body = part
                composition |= TEXTPLAIN
            else:
                # if we are seeing a second text/plain part, stop throwing
                # things
                composition |= MISCPARTS
                break
        elif part.get_content_type() == "text/html":
            composition |= TEXTHTML
        else:
            composition |= MISCPARTS

    if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:
        return compose_message(msg, body)
    else:
        return msg


if __name__ == "__main__":
    import sys
    res = sanitize(email.message_from_file(sys.stdin))
    print res.as_string(unixfrom=False),

Konrad Rudolph · Accepted Answer · 2012-06-23 12:22:09Z

I think the code is quite fine as it is. Some small notes:

if composition == (TEXTPLAIN + TEXTHTML) or composition == TEXTPLAIN:

Just to make sure: this code tests that TEXTPLAIN is set and that MISCPARTS is not set. I would make that explicit, otherwise this code is hard to understand (in particular, it’s easy to miss that MISCPARTS mustn’t be set):

if (composition & TEXTPLAIN) != 0 and (composition & MISCPARTS) == 0:

If you’re not happy with bit operations, you can a set instead.

composition = set()

# …

if TEXTPLAIN in composition and MISCPARTS not in composition:

… and define TEXTPLAIN etc. as simple consecutive constants rather than bitmasks.

This is more pythonic certainly, but I actually find the use of bit masks entirely appropriate here.

Thanks, @Konrad. I thought that the bitfields would be condemned in python, but as you mention right now, they even feel a bit natural. BTW, in C I would use an enum for the constants. Is there any Pythonic idiom with the same "flavour" as C's enums? — rbrito, Jun 24 '12 at 15:58

rahul · Answer 2 · 2012-06-24 04:19:33Z

Your code is fine as it is, and there isn't much I can recommend. However, I believe that this may profit from being made into a class.

#!/usr/bin/env python
import email
import email.message

class MyMail:
    unwanted_fields = ["content-length", "content-type", "lines", "status"]

    def __init__(self, fp):
        self.res = self.sanitize(email.message_from_file(fp))

    def display(self):
        print self.res.as_string(unixfrom=False)

    def compose_message(self, orig, body):
        wanted = email.message.Message()
        wanted.set_payload(body.get_payload())

        # The dictionaries `orig` and `body` have only headers as their items.
        for field in self.unwanted_fields: del orig[field]
        for k, v in orig.items() + body.items(): wanted[k] = v

        return wanted

Your code does not really care about text/html mime, other than to skip it. Other than that, it cares that text/plain is seen only once. This seemed to be an overkill for bit fiddling. Removing that,

    def sanitize(self, msg):
        if not msg.is_multipart(): return msg

        compose = False
        text_taken = False

        for part in msg.walk():
            if (part.get_content_maintype() == "multipart" or
                part.get_content_type() == "message/external-body" or
                part.get_content_type() == "text/html" or
                part.get_payload() == ""):
                continue
            elif part.get_content_type() == "text/plain" and not text_taken:
                 body = part
                 compose = True
                 text_taken = True
                 # if we are seeing a second text/plain part, stop throwing
                 # things
            else: return msg

        return self.compose_message(msg, body) if compose else msg

if __name__ == "__main__":
    import sys
    s = MyMail(sys.stdin)
    s.display()

I’m not sure I agree. While this looks OK and conventional, there’s this great talk arguing that you should Stop Writing Classes. It makes a lot of good points. Well, it could be argued that the presence of display makes a class here worthwhile. Anyway, +1. — Konrad Rudolph, Jun 24 '12 at 8:46
Thanks for analysing my code and concluding that it is not utter terrible. :) I think that, as with @Konrad, I will still keep the code outside of a class (I may change my mind in the future), but I will adopt your absense of bitfields. Your solution is more elegant, IMVHO. — rbrito, Jun 24 '12 at 16:10

asked	2 years ago
viewed	224 times
active	2 years ago

current community

your communities

more stack exchange communities

Improving Python script to drop HTML part of multipart/mixed e-mails

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Improving Python script to drop HTML part of multipart/mixed e-mails

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python or ask your own question.

Related

Hot Network Questions