Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up

Join the Stack Overflow community to:

Ask programming questions
Answer and help your peers
Get recognized for your expertise

Form scraping using scrapy with Javascript embedded

up vote 1 down vote favorite

I am trying to scrape this website. My spider is functional but the website has javascript embedded in the form to get the result, which I can t get through. I've read about selenium and how I have to include a browser to get through it but I still with how to scrape after the dynamically generated HTML is loaded, while still passing form arguments.

Here is the code for my spider, any help, referrals or code snippets are welcome. I've navigated and read many threads to no avail.

from scrapy.spiders import BaseSpider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from tax1.items import Tax1Item
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from selenium import webdriver
import time


class tax1Spider(BaseSpider):
    name = "tax1Spider"
    allowed_domains = ["http://taxesejour.impots.gouv.fr"]
    start_urls = ["http://taxesejour.impots.gouv.fr/DTS_WEB/UK/"]

def parse(self, response):
    yield FormRequest.from_response(response,
                                    formname='PAGE_DELIBV2',
                                    formdata={'A8':'05 - Hautes-alpes',
                                              'A10':'AIGUILLES'},
                                    callback = self.parse1)

rules = (
    Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="lh0 dzSpan dzA15"]',)), callback="parse1", follow= True))

   # lh0 dzSpan dzA15
def __init__(self):
    CrawlSpider.__init__(self)
    # use any browser you wish
    self.browser = webdriver.Firefox()

def __del__(self):
    self.browser.close()

def parse1(self, response):
    #hxs = HtmlXPathSelector(response)

    self.browser.get(response.url)
    time.sleep(3)

    Selector(text=self.browser.page_source)
    items = []
    #item = Tax1Item()
    item['message'] = hxs.select('//td[@id="tzA18"]').extract()
    print item['message']

    return item

edited 26 mins ago

asked 22 hours ago

user21665

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged javascript forms selenium webforms scrapy or ask your own question.

question feed

asked	today
viewed	18 times

current community

your communities

more stack exchange communities

Form scraping using scrapy with Javascript embedded

Your Answer

Browse other questions tagged javascript forms selenium webforms scrapy or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Form scraping using scrapy with Javascript embedded

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged javascript forms selenium webforms scrapy or ask your own question.

Linked

Related

Hot Network Questions