UnicodeDecodeError: Python HTML Parsing

Question

I'm using html.parser from the HTMLParser class to get the data out of a collection of html files. It goes pretty well until a file comes along and the throws an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 419: invalid start byte

My code goes as follows:

class customHTML(HTMLParser):
   # Parses the Data found
   def handle_data(self, data):
        data = data.strip()
        if(data):
            splitData = data.split()
            # Remove punctuation!
            for i in range(len(splitData)):
                splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])
            newCounter = Counter(splitData)
            global wordListprint 
            wordList += newCounter

.

This is in main:

for aFile in os.listdir(inputDirectory):
    if aFile.endswith(".html"):     
        parser = customHTML(strict=False)
        infile = open(inputDirectory+"/"+aFile)
        for line in infile:
            parser.feed(line)

On the parser.feed(line), though, is where everything breaks. It's always the same UnicodeDecodeError. I have no control over what the html files contains, so I need to make it so that I can send it into the parser. Any ideas?

Ryne Everett · Answer 1 · 2014-02-11 02:32:52Z

up vote 0 down vote

While subclassing HTMLParser might be a good exercise, if your html isn't utf8 I'd advise using BeautifulSoup parser, which is quite good at detecting encoding automatically.

answered Feb 11 '14 at 2:32

Ryne Everett
1,1731024

add a comment |

larsks · Answer 2 · 2014-02-11 02:21:46Z

This is a relatively common problem with quite a few SO threads. Checkout this one: Python: Is there a way to determine the encoding of text file?

I'd like to take a moment to comment on your code as well.

Python does not need parenthesis around conditionals. Use

if foo:
    action()

not

if (foo):
    action()

You should define the use of the global once at the top of the function/method not every time through the loop.

This code:

for i in range(len(splitData)):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])

is better written as

for i, data in enumerate(splitData):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', data)

or as

splitData = [ re.sub('[%s]' % re.escape(string.punctuation), '', data) 
              for data in splitData ]

asked	1 year ago
viewed	98 times
active	1 year ago

current community

your communities

more stack exchange communities

UnicodeDecodeError: Python HTML Parsing

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python html or ask your own question.

Visit Chat

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

UnicodeDecodeError: Python HTML Parsing

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html or ask your own question.

Visit Chat

Linked

Related

Hot Network Questions