Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I'm using html.parser from the HTMLParser class to get the data out of a collection of html files. It goes pretty well until a file comes along and the throws an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 419: invalid start byte

My code goes as follows:

class customHTML(HTMLParser):
   # Parses the Data found
   def handle_data(self, data):
        data = data.strip()
        if(data):
            splitData = data.split()
            # Remove punctuation!
            for i in range(len(splitData)):
                splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])
            newCounter = Counter(splitData)
            global wordListprint 
            wordList += newCounter

.

.

.

This is in main:

for aFile in os.listdir(inputDirectory):
    if aFile.endswith(".html"):     
        parser = customHTML(strict=False)
        infile = open(inputDirectory+"/"+aFile)
        for line in infile:
            parser.feed(line)

On the parser.feed(line), though, is where everything breaks. It's always the same UnicodeDecodeError. I have no control over what the html files contains, so I need to make it so that I can send it into the parser. Any ideas?

share|improve this question

2 Answers 2

While subclassing HTMLParser might be a good exercise, if your html isn't utf8 I'd advise using BeautifulSoup parser, which is quite good at detecting encoding automatically.

share|improve this answer

This is a relatively common problem with quite a few SO threads. Checkout this one: Python: Is there a way to determine the encoding of text file?

I'd like to take a moment to comment on your code as well.

Python does not need parenthesis around conditionals. Use

if foo:
    action()

not

if (foo):
    action()

You should define the use of the global once at the top of the function/method not every time through the loop.

This code:

for i in range(len(splitData)):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', splitData[i])

is better written as

for i, data in enumerate(splitData):
    splitData[i] = re.sub('[%s]' % re.escape(string.punctuation), '', data)

or as

splitData = [ re.sub('[%s]' % re.escape(string.punctuation), '', data) 
              for data in splitData ]
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.