Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have problem with the encoding of the text, I am scraping from a website. Specifically the Danish letters æ, ø, and å are coming out wrong. I feel confident that the encoding of the webpage is UTF-8, since the browser is showing it correctly with this encoding.

I have tried using BeautifulSoup as many of the other posts have suggested, but it wasn't for the better. However, I probably did it wrong.

I am using python 2.7 on a windows 7 32 bit OS.

The code I have is this:

# -*- coding: UTF-8 -*-

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 3, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            item['SalgsType'] = site.select("td[4]/text()").extract()
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

It is the items 'Adresse' and 'Salgstype' that contain æ, ø, and å. Any help is greatly appreciated!

Cheers,

share|improve this question
Don't trust in explorer, check what's the Character-Encoding header in the http response. Also, what do you mean about the letter coming out wrong? How are they coming? – Paulo Bu 2 days ago
@PauloBu: How do I check the character-encoding of the http response? The capital Æ comes out as \xc6 and the small letter æ as \xe6 when I crawl my spider from cmd.exe. – Mace 2 days ago
To check the http response header you either need, chrome, firefox firebug, on the other side, you can trust cmd.exe, first of all, do this from cmd: chcp 65001 and later run your script and see if the letters are fine now. chcp 65001 will put cmd.exe with utf-8 code page – Paulo Bu 2 days ago
can you put the code where you print the results to the cmd? – Paulo Bu 2 days ago
@PauloBu: It still looks the same after doing chcp 65001. I have Firebug, though, just can't seem to find out where I get the encoding. – Mace 2 days ago
show 2 more comments

1 Answer

up vote 1 down vote accepted

Ok doing some research I finally checked those characters are indeed those letter but in unicode. Since your cmd.exe doesn't understand unicode, it dumps the bytes of the characters.

You'll have to encode them first in utf-8 and change the code page of the cmd.exe to utf-8

Do this:

To every string you're going to output to the console, call it's method encode('utf-8') like this:

print whatever_string.encode('utf-8')

That's in your code, and in your console, before invoking your script do this:

> chcp 65001
> python your_script.py

Tested this in my python interpreter:

u'\xc6blevangen'.encode('utf-8')
>>>'\xc3\x86blevangen'

Which is the exact AE character encoded in utf-8 :)

Hope it helps!

share|improve this answer
Thanks, good to know that it's the letters in unicode. However, I still have two problems. First of all, item['Adresse'] = site.select("td[1]/a[1]/text()").extract() returns [u'\xc6blevangen']. Which I guess somehow needs to be changed into a string before using encode. Furthermore, what matters to me is not the output cmd.exe but storing the data in a .csv file, will the .encode('utf-8') method work for that as well? – Mace 2 days ago
This is a list with a unicode string: [u'\xc6blevangen']. Extract the unicode string and you'll have .encode method: item['Adresse'][0]. The console was just an example, if you write it to a .txt you'll do ok. Only that the file content will be encoded in utf-8 :) – Paulo Bu 2 days ago
Check the proof of concept I made in my answer :) – Paulo Bu 2 days ago
When I try to do u'\xc6blevangen'.encode('utf-8') in in my python interpreter, it says LookupError: unknown encoding: cp65001. – Mace 2 days ago
If you run python's interpreter in the cmd.exe with that code page, the interpreter will complaint. Just write the code, encode it strings with utf-8 and write it to a .txt file, characters will appear ok. – Paulo Bu 2 days ago
show 2 more comments

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.