Crawl site getting URL and status code

Question

I wrote a crawler that for every page visited collects the status code. Below my solution. Is this code optimizable?

import urllib

def getfromurl(url):
    start = urllib.urlopen(url)
    raw = ''
    for lines in start.readlines():
        raw += lines                   
    start.close()
    return raw

def dumbwork(start_link, start_url, text, pattern, counter):
    if counter < 2:
        counter = counter +1
        while start_link != -1:
            try:
                start_url = text.find('/', start_link) 
                end_url = text.find('"', start_url + 1)
                url = 'http:/' + text[start_url + 1 : end_url]
                page_status = str(urllib.urlopen(url).getcode())
                row = url + ', ' + page_status
                t.write(row + '\n')
                temp = str(getfromurl(url))
                print row
                dumbwork(temp.find(pattern), 0, temp, pattern, counter)
                start_link = text.find(pattern, end_url + 1) 
            except Exception, e:
                break
    else:
        pass

t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()

Welcome to codereview ! Won't return (requests.head(url)).status_code from requests module do this for you ? I'm usually using this module as it's straight-forward and you have like a lot less headaches if you use it over urllib — Grajdeanu Alex, Jul 22 '16 at 7:43

ChatterOne · Accepted Answer · 2016-07-22 09:26:21Z

You're taking for granted that a link will be '<a href="http:/', which is definitely not always the case. What about https:// for example, or if you have something like '<a class="visited" href="http:/' ? That's why you should use a library to parse the DOM objects instead of relying on raw text parsing.
Naming:
- usually a row is related to a database, while a line is related to a text file.
- temp means nothing, it's the new content, so you should use something like new_html_content.
- It takes a bit to understand that counter is actually the max depth that you want to follow, so why not call it depth
- Function names should explain what they do, dumbwork name doesn't, something like recurse_page may be better.
- start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?
- You used snake case for start_link, you should keep using it, so get_from_url may be better.
- start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index
- text is the HTML content, so just rename it to html_content
The lines doing something with row should be next to each other, or better yet, in a separate function.
That 2 should be in a constant so that the first line of the function can be something like if depth < MAX_DEPTH:
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
The text.find to get the url are probably better off in a separate function, to improve readability, something like
getfromurl already returns a string, no need for the str()
You're using always the same name for the file which, when opened with w will overwrite the contents. You should at least check if the file already exists.
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.

In the end, your worker loop may look something like this:

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
    if depth < MAX_DEPTH:
        depth += 1
        while current_link_index > -1:
            try:
                url = get_url(html_content, start_index)
                append_to_file(url, urllib.urlopen(url).getcode())
                new_html_content = get_from_url(url)
                recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
                current_link_index = html_content.find(pattern, end_url_index + 1) 
            except Exception, e:
                # TODO: Proper error handling
                break

It's not complete code, but it should give you an idea of what I mean.

Community · Accepted Answer · 2020-06-10 13:24:26Z

I would recommend using Requests and BeautifulSoup4 for this, it makes it a lot easier.

An example of what you want to do with these two modules:

import requests
from bs4 import BeautifulSoup

resp = requests.get("url")
soup = BeautifulSoup(resp.text, "html.parser")

for item in soup.find_all(href=True):
    link = item.get("href")
    # use link

BeautifulSoup has a lot of other useful searching functionality as well, for example if you only wanted links in 'a' blocks, you could use

for item in soup.find_all("a", href=True):
    link = item.get("href")
    # use link

To install these modules use pip install requests and pip install beautifulsoup4 in your terminal.

Stack Exchange Network

current community

your communities

more stack exchange communities

Crawl site getting URL and status code

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python http web-scraping or ask your own question.

Hot Network Questions

Crawl site getting URL and status code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python http web-scraping or ask your own question.

Related

Hot Network Questions