Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have written some Python code that uses Scrapy and Selenium to scrap restaurant names and addresses from a website. I needed to use Selenium because the button to show more restaurants on a page is in javascript.

The actual list of restaurants per area is two links deep (three in London) - hence the initial three def(parse) / def(parse_dir_contents) / def(parse_dir_contents1) to get me to that list. And then I need to use Selenium to hit the "Show more" button. At that point I can then go one more link deeper to get the data I need.

The one piece of code I am still thinking about is where to put self.driver.close() - if I leave it where it is then the program does have a habit of ending before it should.

Also, though I suspect this is an issue of my PC - when I launch the program I get the message "Please insert a disk into drive E", which I dismiss and then the program works as it should.

If anyone has any ideas on how to improve the Code or on the self.driver.close() position then I am always listening.

import scrapy
import urlparse
import time

from hungryhouse.items import HungryhouseItem
from selenium import webdriver
from scrapy.http import TextResponse

class HungryhouseSpider(scrapy.Spider):
    name = "hungryhouse"
    allowed_domains = ["hungryhouse.co.uk"]
    start_urls =["https://hungryhouse.co.uk/takeaway"]

    # parse / parse_dir_contents / parse_dir_contents1 follow the links the page with the restaurant lists
    # parse starts at the city list page
    def parse(self,response):
        for href in response.xpath('//*[@class="CmsRestcatCityLandingLocations"]/ul[@class="cities"]/li/a/@href'):
           url = urlparse.urljoin('https://hungryhouse.co.uk/',href.extract())
           yield scrapy.Request(url, callback=self.parse_dir_contents)

    # parse_dir_contents will get to the web page with the lists except for London
    def parse_dir_contents(self, response):
        for href1 in response.xpath('//*[contains(text(),"Choose your location")]/../ul/li/a/@href'):
           url1 = urlparse.urljoin('https://hungryhouse.co.uk/',href1.extract())
           if "london-takeaway" in url1:
               yield scrapy.Request(url1, callback=self.parse_dir_contents1)
           yield scrapy.Request(url1, callback=self.parse_dir_contents2)

    # parse_dir_contents1 is needed for London which is one link deeper
    def parse_dir_contents1(self, response):
        for href2 in response.xpath('//*[contains(text(),"Choose your location")]/../ul/li/a/@href'):
           url2 = urlparse.urljoin('https://hungryhouse.co.uk/',href2.extract())
           yield scrapy.Request(url2, callback=self.parse_dir_contents2)       

    def __init__(self):
        self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")

    # and now we are on the web page where the restaurants are listed
    # now we need to use Selenium to activate a javescript button to reveal all the page
    def parse_dir_contents2(self,response):
        self.driver.get(response.url)

        # Pressing the "Show More" button until there are no more on the page to reveal all the page
        while True:
            next =self.driver.find_element_by_xpath('//*[@id="restsPages"]/a')
            try:
                next.click()
                time.sleep(3) # waiting 3 seconds for the page to load fully
            except:
                break

        # Now that the webpage is all revealed Scrapy can bring down all the restaurant URLs
        # I.e. we need to follow the link for every restuarant to get onto its page to get our data
        response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for href in response1.xpath('//*[@class="restsRestInfo"]/a/@href'):
            url = response1.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents3)

#        self.driver.close()

    # and now Scrapy can take the names, addresses and postcodes of all the restaurants from their URL page
    def parse_dir_contents3(self, response):
        item = HungryhouseItem()
        for sel in response.xpath('//*[@class="restBoxInfo"]'):
            item['name']=sel.xpath('//div/div/div/h1/span/text()').extract()[0].strip()
            item['address']=sel.xpath('//*[@id="restMainInfoWrapper"]/div[2]/div/h2/span/span/text()').extract()[0].strip()
            item['postcode']=sel.xpath('//*[@id="restMainInfoWrapper"]/div[2]/div/h2/span/span[last()]/text()').extract()[0].strip()
            yield item
share|improve this question

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.