Cool Python Module: html2text

Tags:


This content is part of the Python Zone, which is presented to you by DZone and New Relic. Visit the Python Zone for news, tips, and tutorials on the Python programming language.  provides the resources and best practices to help you monitor these applications.

Parts of the Fiesta API (as well as some new features we’re rolling out for Fiesta itself) rely on the ability to automatically generate clean text (markdown) versions of incoming messages. Our parser tends to prefer using the text version of a message if possible, as it’s generally easier to parse than the HTML version. That said, sometimes a message only contains an HTML version - we need a way to generate our canonical markdown representation from that alone. Enter html2text.

html2text is a great little Python module by Aaron Swartz that takes HTML as input and generates a markdown version as output. Here’s an example:

>>> import html2text
>>> print html2text.html2text('''<html><body><p>Hello World</p>
<ul><li>Here's one thing</li>
<li>And here's another!</li></ul></body></html>''')
Hello World

  * Here's one thing
  * And here's another!

The output is nicely formatted markdown text, exactly what we were looking for. The only problem we’ve noticed is that the module has some trouble dealing with malformed HTML. Our approach has been to run things through BeautifulSoup first, which tends to do a great job even with crappy markup.

 

Source: http://blog.fiesta.cc/post/14575090271/cool-python-module-html2text

0

Python is a fast, powerful, dynamic, and versatile programming language that is being used in a variety of application domains. It has flourished as a beginner-friendly language that is penetrating more and more industries. The Python Zone is a community that features a diverse collection of news, tutorials, advice, and opinions about Python and Django. The Python Zone is sponsored by New Relic, the all-in-one web application performance tool that lets you see performance from the end user experience, through servers, and down to the line of application code.

Mike is co-founder of Fiesta - https://fiesta.cc. Before Fiesta, he was a developer working on the MongoDB project. Mike is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)