Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

I am trying to scrape this website. My spider is functional but the website has javascript embedded in the form to get the result, which I can t get through. I've read about selenium and how I have to include a browser to get through it but I still with how to scrape after the dynamically generated HTML is loaded, while still passing form arguments.

Here is the code for my spider, any help, referrals or code snippets are welcome. I've navigated and read many threads to no avail.

from scrapy.spiders import BaseSpider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from tax1.items import Tax1Item
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from selenium import webdriver
import time


class tax1Spider(BaseSpider):
    name = "tax1Spider"
    allowed_domains = ["http://taxesejour.impots.gouv.fr"]
    start_urls = ["http://taxesejour.impots.gouv.fr/DTS_WEB/UK/"]

def parse(self, response):
    yield FormRequest.from_response(response,
                                    formname='PAGE_DELIBV2',
                                    formdata={'A8':'05 - Hautes-alpes',
                                              'A10':'AIGUILLES'},
                                    callback = self.parse1)

rules = (
    Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="lh0 dzSpan dzA15"]',)), callback="parse1", follow= True))

   # lh0 dzSpan dzA15
def __init__(self):
    CrawlSpider.__init__(self)
    # use any browser you wish
    self.browser = webdriver.Firefox()

def __del__(self):
    self.browser.close()

def parse1(self, response):
    #hxs = HtmlXPathSelector(response)

    self.browser.get(response.url)
    time.sleep(3)

    Selector(text=self.browser.page_source)
    items = []
    #item = Tax1Item()
    item['message'] = hxs.select('//td[@id="tzA18"]').extract()
    print item['message']

    return item
share|improve this question

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.