I have problem with the encoding of the text, I am scraping from a website. Specifically the Danish letters æ, ø, and å are coming out wrong. I feel confident that the encoding of the webpage is UTF-8, since the browser is showing it correctly with this encoding.
I have tried using BeautifulSoup as many of the other posts have suggested, but it wasn't for the better. However, I probably did it wrong.
I am using python 2.7 on a windows 7 32 bit OS.
The code I have is this:
# -*- coding: UTF-8 -*-
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 3, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
item['SalgsType'] = site.select("td[4]/text()").extract()
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
It is the items 'Adresse' and 'Salgstype' that contain æ, ø, and å. Any help is greatly appreciated!
Cheers,
Character-Encoding
header in the http response. Also, what do you mean about the letter coming out wrong? How are they coming? – Paulo Bu 2 days agocharacter-encoding
of the http response? The capital Æ comes out as \xc6 and the small letter æ as \xe6 when I crawl my spider from cmd.exe. – Mace 2 days agochcp 65001
and later run your script and see if the letters are fine now. chcp 65001 will put cmd.exe with utf-8 code page – Paulo Bu 2 days agochcp 65001
. I have Firebug, though, just can't seem to find out where I get the encoding. – Mace 2 days ago