Problems with character-encoding when webscraping with scrapy

Question

I have problem with the encoding of the text, I am scraping from a website. Specifically the Danish letters æ, ø, and å are coming out wrong. I feel confident that the encoding of the webpage is UTF-8, since the browser is showing it correctly with this encoding.

I have tried using BeautifulSoup as many of the other posts have suggested, but it wasn't for the better. However, I probably did it wrong.

I am using python 2.7 on a windows 7 32 bit OS.

The code I have is this:

# -*- coding: UTF-8 -*-

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 3, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            item['SalgsType'] = site.select("td[4]/text()").extract()
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

It is the items 'Adresse' and 'Salgstype' that contain æ, ø, and å. Any help is greatly appreciated!

Cheers,

Don't trust in explorer, check what's the Character-Encoding header in the http response. Also, what do you mean about the letter coming out wrong? How are they coming?
@PauloBu: How do I check the character-encoding of the http response? The capital Æ comes out as \xc6 and the small letter æ as \xe6 when I crawl my spider from cmd.exe.
To check the http response header you either need, chrome, firefox firebug, on the other side, you can trust cmd.exe, first of all, do this from cmd: chcp 65001 and later run your script and see if the letters are fine now. chcp 65001 will put cmd.exe with utf-8 code page
can you put the code where you print the results to the cmd?
@PauloBu: It still looks the same after doing chcp 65001. I have Firebug, though, just can't seem to find out where I get the encoding.

Paulo Bu · Accepted Answer · 2013-06-07 19:59:08Z

up vote 1 down vote accepted

Ok doing some research I finally checked those characters are indeed those letter but in unicode. Since your cmd.exe doesn't understand unicode, it dumps the bytes of the characters.

You'll have to encode them first in utf-8 and change the code page of the cmd.exe to utf-8

Do this:

To every string you're going to output to the console, call it's method encode('utf-8') like this:

print whatever_string.encode('utf-8')

That's in your code, and in your console, before invoking your script do this:

> chcp 65001
> python your_script.py

Tested this in my python interpreter:

u'\xc6blevangen'.encode('utf-8')
>>>'\xc3\x86blevangen'

Which is the exact AE character encoded in utf-8 :)

Hope it helps!

edited 2 days ago

answered 2 days ago

Paulo Bu
3,0461312

	Thanks, good to know that it's the letters in unicode. However, I still have two problems. First of all, `item['Adresse'] = site.select("td[1]/a[1]/text()").extract()` returns `[u'\xc6blevangen']`. Which I guess somehow needs to be changed into a string before using encode. Furthermore, what matters to me is not the output `cmd.exe` but storing the data in a .csv file, will the .encode('utf-8') method work for that as well? – Mace 2 days ago
	This is a list with a unicode string: `[u'\xc6blevangen']`. Extract the unicode string and you'll have .encode method: `item['Adresse'][0]`. The console was just an example, if you write it to a `.txt` you'll do ok. Only that the file content will be encoded in utf-8 :) – Paulo Bu 2 days ago
	Check the proof of concept I made in my answer :) – Paulo Bu 2 days ago
	When I try to do `u'\xc6blevangen'.encode('utf-8')` in in my python interpreter, it says `LookupError: unknown encoding: cp65001`. – Mace 2 days ago
	If you run python's interpreter in the cmd.exe with that code page, the interpreter will complaint. Just write the code, encode it strings with utf-8 and write it to a .txt file, characters will appear ok. – Paulo Bu 2 days ago

show 2 more comments

asked	2 days ago
viewed	18 times
active	2 days ago

Problems with character-encoding when webscraping with scrapy

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python character-encoding web-scraping scrapy or ask your own question.

Problems with character-encoding when webscraping with scrapy

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python character-encoding web-scraping scrapy or ask your own question.

Related