Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. 'B for 'Blackout).
This worked better for me:
import re
from BeautifulSoup import BeautifulSoup
html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">'Blackout in a can; on some shelves despite ban</a>'
hexentityMassage = [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
soup = BeautifulSoup(html_string,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage)
- The int(m.group(1), 16) converts the number (specified in base-16) format back to an integer.
- m.group(0) returns the entire match, m.group(1) returns the regexp capturing group
- Basically using markupMessage is the same as:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)