Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up

Here's how it works:

Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top

Python Scrapy code using Selenium

up vote 0 down vote favorite

I have written some Python code that uses Scrapy and Selenium to scrap restaurant names and addresses from a website. I needed to use Selenium because the button to show more restaurants on a page is in javascript.

The actual list of restaurants per area is two links deep (three in London) - hence the initial three def(parse) / def(parse_dir_contents) / def(parse_dir_contents1) to get me to that list. And then I need to use Selenium to hit the "Show more" button. At that point I can then go one more link deeper to get the data I need.

The one piece of code I am still thinking about is where to put self.driver.close() - if I leave it where it is then the program does have a habit of ending before it should.

Also, though I suspect this is an issue of my PC - when I launch the program I get the message "Please insert a disk into drive E", which I dismiss and then the program works as it should.

If anyone has any ideas on how to improve the Code or on the self.driver.close() position then I am always listening.

import scrapy
import urlparse
import time

from hungryhouse.items import HungryhouseItem
from selenium import webdriver
from scrapy.http import TextResponse

class HungryhouseSpider(scrapy.Spider):
    name = "hungryhouse"
    allowed_domains = ["hungryhouse.co.uk"]
    start_urls =["https://hungryhouse.co.uk/takeaway"]

    # parse / parse_dir_contents / parse_dir_contents1 follow the links the page with the restaurant lists
    # parse starts at the city list page
    def parse(self,response):
        for href in response.xpath('//*[@class="CmsRestcatCityLandingLocations"]/ul[@class="cities"]/li/a/@href'):
           url = urlparse.urljoin('https://hungryhouse.co.uk/',href.extract())
           yield scrapy.Request(url, callback=self.parse_dir_contents)

    # parse_dir_contents will get to the web page with the lists except for London
    def parse_dir_contents(self, response):
        for href1 in response.xpath('//*[contains(text(),"Choose your location")]/../ul/li/a/@href'):
           url1 = urlparse.urljoin('https://hungryhouse.co.uk/',href1.extract())
           if "london-takeaway" in url1:
               yield scrapy.Request(url1, callback=self.parse_dir_contents1)
           yield scrapy.Request(url1, callback=self.parse_dir_contents2)

    # parse_dir_contents1 is needed for London which is one link deeper
    def parse_dir_contents1(self, response):
        for href2 in response.xpath('//*[contains(text(),"Choose your location")]/../ul/li/a/@href'):
           url2 = urlparse.urljoin('https://hungryhouse.co.uk/',href2.extract())
           yield scrapy.Request(url2, callback=self.parse_dir_contents2)       

    def __init__(self):
        self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")

    # and now we are on the web page where the restaurants are listed
    # now we need to use Selenium to activate a javescript button to reveal all the page
    def parse_dir_contents2(self,response):
        self.driver.get(response.url)

        # Pressing the "Show More" button until there are no more on the page to reveal all the page
        while True:
            next =self.driver.find_element_by_xpath('//*[@id="restsPages"]/a')
            try:
                next.click()
                time.sleep(3) # waiting 3 seconds for the page to load fully
            except:
                break

        # Now that the webpage is all revealed Scrapy can bring down all the restaurant URLs
        # I.e. we need to follow the link for every restuarant to get onto its page to get our data
        response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for href in response1.xpath('//*[@class="restsRestInfo"]/a/@href'):
            url = response1.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents3)

#        self.driver.close()

    # and now Scrapy can take the names, addresses and postcodes of all the restaurants from their URL page
    def parse_dir_contents3(self, response):
        item = HungryhouseItem()
        for sel in response.xpath('//*[@class="restBoxInfo"]'):
            item['name']=sel.xpath('//div/div/div/h1/span/text()').extract()[0].strip()
            item['address']=sel.xpath('//*[@id="restMainInfoWrapper"]/div[2]/div/h2/span/span/text()').extract()[0].strip()
            item['postcode']=sel.xpath('//*[@id="restMainInfoWrapper"]/div[2]/div/h2/span/span[last()]/text()').extract()[0].strip()
            yield item

asked Dec 22 '16 at 19:43

nevster

284

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python web-scraping selenium scrapy or ask your own question.

question feed

asked	2 months ago
viewed	255 times

current community

your communities

more stack exchange communities

Python Scrapy code using Selenium

Your Answer

Browse other questions tagged python web-scraping selenium scrapy or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python Scrapy code using Selenium

Can you help? Code Review Stack Exchange depends on everyone sharing their knowledge. If you're able to answer this question, please do!

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python web-scraping selenium scrapy or ask your own question.

Related

Hot Network Questions