URL and source page scraper

Question

The code does seem a bit repetitive in places such as the parenturlscraper module and the childurlscraper module.

Does anyone have any recommendations for improving my code and condensing it a little?

In essence, the code scrapes this site and populates a table with details about each crash, geocoding the location data extracted from the site using Google.

__version__ = '0.1'
__author__ = 'antmancoder'

# Importing of modules required for the script to run successfully
import scraperwiki
import lxml.html
import urlparse
import urllib2
import dateutil.parser
from geopy import geocoders

# Introduction of various global variables required throughout the running of the code
urlstem = "http://planecrashinfo.com"
urlyeardb = "database.htm"
yearsource = urlparse.urljoin(urlstem, urlyeardb)
yearlist = []
sourcepageurl = []


def parenturlscraper():
    """Function scrapes all of the parent URLs from 'planecrashinfo.com/database'"""
    html = scraperwiki.scrape(yearsource)
    root = lxml.html.fromstring(html)

    hrefs = root.cssselect('td a')

    for href in hrefs:
        link = href.attrib['href']
        url = urlparse.urljoin(urlstem, link)
        yearlist.append(url)

def childurlscraper():
    """Function scrapes all of the child URLs from those scraped in the parenturlscraper module"""
    for url in yearlist:
        html = scraperwiki.scrape(url)
        root = lxml.html.fromstring(html)
        hrefs = root.cssselect('td a')
        url = url[0:34]
        for href in hrefs:
            linkurl = href.attrib['href']
            url = urlparse.urljoin(url, linkurl)
            sourcepageurl.append(url)

def sourcepagescraper(): 
    """Function scrapes respective data for each accident and placed it into DB"""
    for url in sourcepageurl:
        try: 
            html = scraperwiki.scrape(url)
            root = lxml.html.fromstring(html)
            for tr in root.cssselect("body"):
                tds = tr.cssselect("td")
                location = coorlookup(tds[7].text_content())
                for td in tds:
                    crashinfo = {}
                    crashinfo['url'] = url
                    crashinfo['date'] = dateutil.parser.parse(tds[3].text_content()).date()
                    crashinfo['time'] = tds[5].text_content()
                    crashinfo['location'] = tds[7].text_content()
                    crashinfo['latitude'] = location[1][0]
                    crashinfo['longitude'] = location[1][1]
                    crashinfo['operator'] = tds[9].text_content()
                    crashinfo['flight no'] = tds[11].text_content()
                    crashinfo['route'] = tds[13].text_content()
                    crashinfo['aircraft type'] = tds[15].text_content()
                    crashinfo['registration'] = tds[17].text_content()
                    crashinfo['cn ln'] = tds[19].text_content()
                    crashinfo['aboard'] = tds[21].text_content()
                    crashinfo['fatalities'] = tds[23].text_content()
                    crashinfo['ground'] = tds[25].text_content()
                    crashinfo['summary'] = tds[27].text_content()

                scraperwiki.sqlite.save(unique_keys=['url'], data=crashinfo)
        except urllib2.HTTPError, err:
            if err.code == 404:
                continue

def coorlookup(location):
    """Function is called from the 'sourcepagescraper' function to geolocate locations listed on website for each accident"""
    g = geocoders.Google()
    try:
        loc = g.geocode(location, exactly_one=True)
        return loc
    except:
        return ("",("",""))

parenturlscraper()
childurlscraper()
sourcepagescraper()

Winston Ewert · Answer 1 · 2013-02-13 16:35:49Z

__version__ = '0.1'
__author__ = 'antmancoder'

# Importing of modules required for the script to run successfully

Comments should be used for the non-obvious parts of your code. Explaining that we have imports is obvious. It doesn't need a comment.

import scraperwiki
import lxml.html
import urlparse
import urllib2
import dateutil.parser
from geopy import geocoders

# Introduction of various global variables required throughout the running of the code
urlstem = "http://planecrashinfo.com"
urlyeardb = "database.htm"

Global constants should be in ALL_CAPS, by convention

yearsource = urlparse.urljoin(urlstem, urlyeardb)
yearlist = []
sourcepageurl = []

You shouldn't have global variables. It is better to return and pass parameters.

def parenturlscraper():
    """Function scrapes all of the parent URLs from 'planecrashinfo.com/database'"""
    html = scraperwiki.scrape(yearsource)
    root = lxml.html.fromstring(html)

    hrefs = root.cssselect('td a')

    for href in hrefs:
        link = href.attrib['href']
        url = urlparse.urljoin(urlstem, link)
        yearlist.append(url)

Instead of appending to a global variable, you should append them into a local list and return it.

def childurlscraper():
    """Function scrapes all of the child URLs from those scraped in the parenturlscraper module"""
    for url in yearlist:
        html = scraperwiki.scrape(url)
        root = lxml.html.fromstring(html)

Just did this, you should write a function that that combines the above two lines

        hrefs = root.cssselect('td a')
        url = url[0:34]

Why? This is a good place to have a comment explaining why you just did that.

        for href in hrefs:
            linkurl = href.attrib['href']
            url = urlparse.urljoin(url, linkurl)

You should store in child_url. As it is you are reusing the variable from before. Your code will actually reeatedly join the elements onto each other. So I'm pretty sure that's not what you wanted.

            sourcepageurl.append(url)

Again, append onto a local list, and then return that.

def sourcepagescraper(): 
    """Function scrapes respective data for each accident and placed it into DB"""
    for url in sourcepageurl:
        try: 
            html = scraperwiki.scrape(url)
            root = lxml.html.fromstring(html)
            for tr in root.cssselect("body"):

Okay, you call it tr but its actually the body. Also, isn't there only ever one body? Do you really need a lloop?

                tds = tr.cssselect("td")
                location = coorlookup(tds[7].text_content())
                for td in tds:

This doesn't seem right. You wnat to iterate over each cell? But then you look at all the cells to fetch the data.

                    crashinfo = {}
                    crashinfo['url'] = url
                    crashinfo['date'] = dateutil.parser.parse(tds[3].text_content()).date()
                    crashinfo['time'] = tds[5].text_content()
                    crashinfo['location'] = tds[7].text_content()
                    crashinfo['latitude'] = location[1][0]
                    crashinfo['longitude'] = location[1][1]
                    crashinfo['operator'] = tds[9].text_content()
                    crashinfo['flight no'] = tds[11].text_content()
                    crashinfo['route'] = tds[13].text_content()
                    crashinfo['aircraft type'] = tds[15].text_content()
                    crashinfo['registration'] = tds[17].text_content()
                    crashinfo['cn ln'] = tds[19].text_content()
                    crashinfo['aboard'] = tds[21].text_content()
                    crashinfo['fatalities'] = tds[23].text_content()
                    crashinfo['ground'] = tds[25].text_content()
                    crashinfo['summary'] = tds[27].text_content()

You can use something like:

crashinfo = {
        'url' : url,
        'date' : dateutil.parser.parse(tds[3].text_content())).date()
        ...
}

which will probably be a bit better. Also, I'd avoid reference the cells by the index. That'll break if new cells are added. Are there classes on the cells you can use? I'd also probably make a big list like:

DATA = [
    ('time', 5),
    ('operator', 9)
]

And then loop over the list to fill in the dict.

                scraperwiki.sqlite.save(unique_keys=['url'], data=crashinfo)
        except urllib2.HTTPError, err:
            if err.code == 404:
                continue

Firstly, you want to put as little as possible in the try block. This is helped by the fact that python has an else block. Everything after .scrape should be in the else block. But the error handling also is a bit off. I think you want to ignore scrapings when the page cannot be found. But you should really reraise the exception when its not that. As it stands, you ignore it. Also, continue is exactly the same as pass here.

def coorlookup(location):
    """Function is called from the 'sourcepagescraper' function to geolocate locations listed on website for each accident"""
    g = geocoders.Google()

Avoid single letter variables names, they are harder to follow

    try:
        loc = g.geocode(location, exactly_one=True)
        return loc

Combine those two lines

    except:

Don't ever (well almost ever) use bare excepts. Catch the specif error you want to handle. return ("",("",""))

parenturlscraper()
childurlscraper()
sourcepagescraper()

asked	2 years ago
viewed	264 times
active	1 year ago

current community

your communities

more stack exchange communities

URL and source page scraper

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python url web-scraping or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

URL and source page scraper

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python url web-scraping or ask your own question.

Related

Hot Network Questions