python / scrapy code optimization

Question

i wrote this code in python/ scrapy for data extraction it works fine for me but i am newbie in python so any suggestions for code optimization would be welcomed , and thanks in advance.

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item = GenesishcsItem()
    item['graduate_education']=[]
    item['medical_school']=[]
    item['specialty']=[]
    item['name']=ParseUtils.get_line_from_node(hxs.select(".//*[@id='DrDetail']//div[@class='Designations']/h1"))
    image_url=hxs.select("//img[contains(@id,'DoctorProfilePicture')]/@src").extract()
    if image_url:
        item['image_url']=urljoin(response.request.meta['cached_meta']['response_url'],image_url[0])
        for  key,value in {'Gender:':'gender','Year joined staff:':'year_joined_staff'}.iteritems():
            result=ParseUtils.get_line_from_node(hxs.select("//dt[contains(.,'%s')]/following-sibling::dd"%key))
            if result:
                item[value]=result

    edu_data=['Education','Residency','Fellowship','Internships']
    for edu in edu_data:
        result_date=hxs.select("//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']/dt"%(edu,edu))
        result=hxs.select("//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']/dd"%(edu,edu))
        for date,value in zip(result_date,result):
            date=date.select('./text()').extract()
            value=value.select('./text()').extract()
            if date:
                date=date[0]
            else:
                date=None
            if value:
                value=value[0]
            else:
                value=None
            if 'Education' in edu:
                pass
                item['medical_school'].append({'name':value,'graduation_year':date})
            else:
                item['graduate_education'].append({'name':value,'start_year_end_year':date,'type':edu.lower().strip()})
    for rs in ParseUtils.get_text_from_node(hxs.select("//h3[contains(.,'Board Certified')]/following-sibling::p")):
        item['specialty'].append({'name':rs,'certified':True})
    return item

San4ez · Answer 1 · 2012-02-28 11:56:03Z

up vote 1 down vote

Your code is messy and difficult to understand. I believe that in a few month you will waste time to understand what you have written earlier. So it is can be ok if you are not going to support/extend/re-use your project.

Scrapy offer quite good architecture to organize your code properly: Item Loaders and Input and Output processors, Item Pipeline. It will move processing logic and make your main parse method cleaner.

answered Feb 28 '12 at 11:56

San4ez
388129

	Thanks @San4ez, its a business requirement to implement processing logic in parse method, that will process again in item pipelines,so i am just looking for suggestions to make it more efficient and cool looking in parse method . – akhter wahab Feb 28 '12 at 12:35
	You also can move processing logic to your own functions or Spider methods if you don't use scrapy processors. – San4ez Feb 29 '12 at 5:13

Quentin Pradet · Answer 2 · 2012-02-28 09:56:44Z

The code looks good, but as always, there are always issues one can bring up. I don't believe them to be important though. Your code seems to be getting the job code in a concise way, and if your XPath expressions are robust, there's no need to worry.

Do you have actual performance issues? If so, you should profile your code to see where your program is spending most of its time.
The pass statement at the end of your code is useless.
It's a bit sad to see the code duplication when you're doing date=date[0] but then it's a bit hard to avoid in Python. There's always the possibility to write date = date[0] if date else none but you could say it's more difficult to read.
The way you're reassigning your variables is a bit dangerous since the names refer to different types depending only on the "current line number". For small scripts it's OK.
Try to find more descriptive names, result and value are vague.
You could want to "cache" some of the nodes (eg. "//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']) so that there's less duplication in your XPath expressions.

thanks for your valuable and very detailed suggestions.will definitely follow..
i tried this value=value[0] if value.select('./text()').extract() else None instead of value=value.select('./text()').extract() if value: value=value[0] else: value=None i am getting error value=value[0] if value.select('./text()').extract() else None exceptions.TypeError: 'HtmlXPathSelector' object does not support indexing
Your problem is precisely an example of the problems you can encounter when shadowing earlier variables. value in your new code refers to the variabled defined in for date,value in zip(result_date,result): whereas in the old code it was defined by value=value.select('./text()').extract(). Simply keep the value=value.select('./text()').extract() assignment and write value = value[0] if value else None.

asked	1 year ago
viewed	400 times
active	1 year ago

python / scrapy code optimization

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python scrapy or ask your own question.

Community Bulletin

python / scrapy code optimization

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python scrapy or ask your own question.

Community Bulletin

Related