Convert XML/HTML Entities into Unicode String in Python

Question

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

&#x01ce;

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

dF. · Accepted Answer · 2008-09-12 01:40:41Z

up vote 40 down vote accepted

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities.

answered Sep 12 '08 at 1:40

dF.
25.7k869107

2

That function works wonderfully. Long live Fredrik – Vinko Vrsalovic Oct 16 '08 at 9:39

Absolutely. Why is not in stdlib? – smci Aug 13 '12 at 11:21

Looking at its code, it doesn't seem to work with & and such, does it? – jnns Jun 14 at 12:22

Vladislav Polukhin · Answer 2 · 2012-09-27 05:34:44Z

up vote 16 down vote

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('&copy; 2010') # u'\xa9 2010'
h.unescape('&#169; 2010') # u'\xa9 2010'

answered Sep 27 '12 at 5:34

Vladislav Polukhin
32527

it also works for hex entities. The implementation is very similar to unescape() function from @dF.'s answer. – J.F. Sebastian Oct 2 '12 at 21:26

5

This method isn't documented in Python's HTMLParser documentation, and there's a comment in the source stating it's intended for internal use. However, it works like treat in Python 2.6 through 2.7, and is probably the best solution out there. Prior to version 2.6, it would only decode named entities like & or >. – Aram Dulyan Oct 17 '12 at 0:34

chryss · Answer 3 · 2008-09-11 23:09:08Z

up vote 15 down vote

Use the builtin unichr -- BeautifulSoup isn't necessary:

>>> entity = '&#x01ce'
>>> unichr(int(entity[3:],16))
u'\u01ce'

answered Sep 11 '08 at 23:09

chryss
2,8851021

But that requires you to automatically and unambiguously know where in the string the encoded Unicode character is/are - which you can't know. And you need to try...catch the resulting exception for when you get it wrong. – smci Aug 13 '12 at 11:22

J.F. Sebastian · Answer 4 · 2011-10-10 17:07:39Z

You could find an answer here -- Getting international characters from a web page?

EDIT: It seems like BeautifulSoup doesn't convert entities written in hexadecimal form. It can be fixed:

import copy, re
from BeautifulSoup import BeautifulSoup

hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'), 
                     lambda m: '&#%d;' % int(m.group(1), 16))]

def convert(html):
    return BeautifulSoup(html,
        convertEntities=BeautifulSoup.HTML_ENTITIES,
        markupMassage=hexentityMassage).contents[0].string

html = '<html>&#x01ce;&#462;</html>'
print repr(convert(html))
# u'\u01ce\u01ce'

EDIT:

unescape() function mentioned by @dF which uses htmlentitydefs standard module and unichr() might be more appropriate in this case.

This solution doesn't work with the example: print BeautifulSoup('<html>ǎ</html>', convertEntities=BeautifulSoup.HTML_ENTITIES) This returns the same HTML entity

pragmar · Answer 5 · 2012-02-09 19:01:45Z

up vote 6 down vote

An alternative, if you have lxml:

>>> import lxml.html
>>> lxml.html.fromstring('&#x01ce').text
u'\u01ce'

edited Feb 9 '12 at 19:01

answered Feb 9 '12 at 18:55

pragmar
30527

karlcow · Answer 6 · 2009-02-21 19:45:58Z

This is a function which should help you to get it right and convert entities back to utf-8 characters.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
   @param text The HTML (or XML) source text.
   @return The plain text, as a Unicode string, if necessary.
   from Fredrik Lundh
   2008-01-03: input only unicode characters string.
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "Value Error"
            pass
      else:
         # named entity
         # reescape the reserved characters.
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

because the person wanted the character in unicode instead of utf-8 characters. I guess :)

Balthazar Rouberol · Answer 7 · 2013-03-14 15:58:17Z

Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. &#39B for &#39Blackout).

This worked better for me:

import re
from BeautifulSoup import BeautifulSoup

html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">&#x27;Blackout in a can; on some shelves despite ban</a>'

hexentityMassage = [(re.compile('&#x([^;]+);'), 
lambda m: '&#%d;' % int(m.group(1), 16))]

soup = BeautifulSoup(html_string, 
convertEntities=BeautifulSoup.HTML_ENTITIES, 
markupMassage=hexentityMassage)

The int(m.group(1), 16) converts the number (specified in base-16) format back to an integer.
m.group(0) returns the entire match, m.group(1) returns the regexp capturing group
Basically using markupMessage is the same as:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)

asked	4 years ago
viewed	33216 times
active	5 months ago

Convert XML/HTML Entities into Unicode String in Python

7 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python html entities or ask your own question.

Linked

Convert XML/HTML Entities into Unicode String in Python

7 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html entities or ask your own question.

Linked

Related