Tell me more ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

i wrote this code in python/ scrapy for data extraction it works fine for me but i am newbie in python so any suggestions for code optimization would be welcomed , and thanks in advance.

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item = GenesishcsItem()
    item['graduate_education']=[]
    item['medical_school']=[]
    item['specialty']=[]
    item['name']=ParseUtils.get_line_from_node(hxs.select(".//*[@id='DrDetail']//div[@class='Designations']/h1"))
    image_url=hxs.select("//img[contains(@id,'DoctorProfilePicture')]/@src").extract()
    if image_url:
        item['image_url']=urljoin(response.request.meta['cached_meta']['response_url'],image_url[0])
        for  key,value in {'Gender:':'gender','Year joined staff:':'year_joined_staff'}.iteritems():
            result=ParseUtils.get_line_from_node(hxs.select("//dt[contains(.,'%s')]/following-sibling::dd"%key))
            if result:
                item[value]=result

    edu_data=['Education','Residency','Fellowship','Internships']
    for edu in edu_data:
        result_date=hxs.select("//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']/dt"%(edu,edu))
        result=hxs.select("//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']/dd"%(edu,edu))
        for date,value in zip(result_date,result):
            date=date.select('./text()').extract()
            value=value.select('./text()').extract()
            if date:
                date=date[0]
            else:
                date=None
            if value:
                value=value[0]
            else:
                value=None
            if 'Education' in edu:
                pass
                item['medical_school'].append({'name':value,'graduation_year':date})
            else:
                item['graduate_education'].append({'name':value,'start_year_end_year':date,'type':edu.lower().strip()})
    for rs in ParseUtils.get_text_from_node(hxs.select("//h3[contains(.,'Board Certified')]/following-sibling::p")):
        item['specialty'].append({'name':rs,'certified':True})
    return item
share|improve this question

2 Answers

Your code is messy and difficult to understand. I believe that in a few month you will waste time to understand what you have written earlier. So it is can be ok if you are not going to support/extend/re-use your project.

Scrapy offer quite good architecture to organize your code properly: Item Loaders and Input and Output processors, Item Pipeline. It will move processing logic and make your main parse method cleaner.

share|improve this answer
Thanks @San4ez, its a business requirement to implement processing logic in parse method, that will process again in item pipelines,so i am just looking for suggestions to make it more efficient and cool looking in parse method . – akhter wahab Feb 28 '12 at 12:35
You also can move processing logic to your own functions or Spider methods if you don't use scrapy processors. – San4ez Feb 29 '12 at 5:13

The code looks good, but as always, there are always issues one can bring up. I don't believe them to be important though. Your code seems to be getting the job code in a concise way, and if your XPath expressions are robust, there's no need to worry.

  • Do you have actual performance issues? If so, you should profile your code to see where your program is spending most of its time.
  • The pass statement at the end of your code is useless.
  • It's a bit sad to see the code duplication when you're doing date=date[0] but then it's a bit hard to avoid in Python. There's always the possibility to write date = date[0] if date else none but you could say it's more difficult to read.
  • The way you're reassigning your variables is a bit dangerous since the names refer to different types depending only on the "current line number". For small scripts it's OK.
  • Try to find more descriptive names, result and value are vague.
  • You could want to "cache" some of the nodes (eg. "//h4[contains(.,'%s')]/following-sibling::dl[@class='%s']) so that there's less duplication in your XPath expressions.
share|improve this answer
thanks for your valuable and very detailed suggestions.will definitely follow.. – akhter wahab Feb 28 '12 at 10:33
i tried this value=value[0] if value.select('./text()').extract() else None instead of value=value.select('./text()').extract() if value: value=value[0] else: value=None i am getting error value=value[0] if value.select('./text()').extract() else None exceptions.TypeError: 'HtmlXPathSelector' object does not support indexing – akhter wahab Feb 28 '12 at 10:56
Your problem is precisely an example of the problems you can encounter when shadowing earlier variables. value in your new code refers to the variabled defined in for date,value in zip(result_date,result): whereas in the old code it was defined by value=value.select('./text()').extract(). Simply keep the value=value.select('./text()').extract() assignment and write value = value[0] if value else None. – Quentin Pradet Feb 28 '12 at 11:35
thank you @Cygal.. – akhter wahab Feb 28 '12 at 11:39

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.