Python webpage parser

Question

I have a working code, but it has a lot of flaws.

The main project is to get a value from a website. I need to complete this action million of times, so i download the webpage returned from the class method, and parse it afterwards. In case the page is already downloaded i just open that one up.

This code is working, however even with the multiple except socket.error i do sometimes get socket errors when running it in parallel.

Also i am kinda new to programming, so any advice would be really appreciated. (I know this code is kinda too long)

class Uniprot:
    """
    Gets the UNIPROT XML or HTML, very bad code.... Cant come up with a better one yet :(
    """

    def __init__(self, uniid):
        """
        Initialize stuff
        :param uniid: Same as in parent class
        :return:
        """
        self.uniid = uniid.upper()
        self.soup = ""

    def xml(self):
        """
        returns the full XML file
        :return:
        """
        xmlpage = urllib2.urlopen("http://www.uniprot.org/uniprot/%s.xml" % self.uniid, timeout=2)
        if xmlpage.getcode() != 200:
            return
        self.soup = BeautifulSoup(xmlpage)
        return self.soup.get_text

    def web(self):
        """
        Returns the full HTML file
        :return:
        """
        webpage = urllib2.urlopen("http://www.uniprot.org/uniprot/%s" % self.uniid, timeout=2)
        # UNIPROT error handler
        if webpage.getcode() != 200:
            return
        websoup = BeautifulSoup(webpage)
        return websoup.get_text

def uniprotannotation():
    """
    Chaos, and anarchy right again, same with the errors, plus its an HTML variable instead of an XML
    :return: Annotation score from UNIPROT
    """
    uniprotxml = ""
    soup = ""
    # Check if the folder is present
    if not os.path.isdir("%s/%s/UNIPROT" % (DATA_DIR, FASTA_OBJECT.organism())):
        os.makedirs("%s/%s/UNIPROT" % (DATA_DIR, FASTA_OBJECT.organism()))
    # Check if the file is present, if not create it
    try:
        f = open("%s/%s/UNIPROT/%s.html" % (DATA_DIR,
                                            FASTA_OBJECT.organism(), FASTA_OBJECT.accession()), "r")
        for i in f:
            uniprotxml += i
        f.close()
        # Check if the file is empty
        if os.stat("%s/%s/UNIPROT/%s.html" % (DATA_DIR,
                                              FASTA_OBJECT.organism(), FASTA_OBJECT.accession())).st_size == 0:
            os.remove("%s/%s/UNIPROT/%s.html" % (DATA_DIR,
                                                 FASTA_OBJECT.organism(), FASTA_OBJECT.accession()))
            sys.stderr.write(
                    "%s, removed empty html, running it again\n" % sys.argv[1])
            uniprotannotation()
        soup = BeautifulSoup(uniprotxml)
    except IOError:
        try:
            soup = BeautifulSoup(str(FASTA_OBJECT.Uniprot(sys.argv[1]).web()))
        except socket.timeout:
            sys.stderr.write(
                    "%s, HTML socket timeout, trying again\n" % sys.argv[1])
            uniprotannotation()
        except socket.error:
            sys.stderr.write(
                    "%s, HTML socket error, trying again\n" % sys.argv[1])
            uniprotannotation()
        f = open("%s/%s/UNIPROT/%s.html" % (DATA_DIR,
                                            FASTA_OBJECT.organism(), FASTA_OBJECT.accession()), "w")
        f.write(str(FASTA_OBJECT.Uniprot(sys.argv[1]).web()))
        f.close()
    if not soup:
        return "-"
    result = re.search("Annotation score: \d out of \d", str(soup.get_text))
    if not result:
        return "-"
    return re.search("\d", str(result.group())).group()[0]

Code Review isn't really a bug fixing site, but it is still helpful to explain where you're getting those socket errors, it would help anyone who wanted to give feedback on the bug that interferes with parallel processing. — SuperBiasedMan, Feb 10 at 17:26
I’m no expert but shouldn't Beautifulsoup be explicitly told to process XML? — Mathias Ettinger, Feb 10 at 20:40

SuperBiasedMan · Accepted Answer · 2016-02-10 17:25:15Z

This looks good, nice work! I do think Uniprot is a bit long, but mostly because of comments.

"""
Gets the UNIPROT XML or HTML, very bad code.... Cant come up with a better one yet :(
"""

No need to criticise yourself in docstrings. Aside from everything else, this makes your comment too long without information. If it's really that bad, people will notice! Just stick with explaining unclear aspects to the code, or giving context to why the code does what it does. Other unnecessary comments:

    Initialize stuff

Most people familiar with classes will know what __init__ does. Also avoid vague language like "stuff". Initialise parameters would tell you slightly more, initialise uniid or even Unification ID. I obviously made that up, but only that last one adds information (telling the user what a uniid is).

    :return:

I'm not familiar with the docstring convention you're using, but does it really insist on including return lines even when they're blank? Not to mention the fact that you do return things from two of these functions, but your docstring claims otherwise. Either remove it or make the return lines accurate.

Next, you only ever use uniid to do this:

    "http://www.uniprot.org/uniprot/%s.xml" % self.uniid

Why not just initialise the URL in __init__ instead then?

def __init__(self, uniid):
    self.url = uniid.upper("http://www.uniprot.org/uniprot/%s" % self.uniid)

Now you can use self.url and self.url + '.xml' instead. Still on __init__, is there any need to initialise soup? Or even to store it as an attribute at all? You only need to return it from the function. Like you do with websoup. If there's an advantage to keeping it, why not do the same with websoup? You may have good reason, but if there's inconsistent treatment in your code you should comment it for the sake of clarity.

Next, you use a couple magic numbers in your code. Magic numbers are basically when you put numbers into code but it's not really clear from context what they mean or why they have to be that value.

    if xmlpage.getcode() != 200:

This is presumable a success code that the page returns? But it would be more readable like this:

    if xmlpage.getcode() != Uniprot.SUCCESS_CODE:

This is a constant value now, that you'd define as part of the class here:

class Uniprot:

    SUCCESS_CODE = 200

Neater and more readable, it makes perfect sense now (not that your initial case was the most egregious). Similarly, you set timeout to 2 in both cases. This is clearer to follow since timeout=2 is quite readable. But it's better organised if you put this as a class constant or even a value in __init__ that defaults to 2. Either way, it allows a user to change the value. What if they have woefully slow internet? Allowing them access makes the class more useful.

Your file reading seems a bit messy. I can't even tell why you're using such a complex method to check if the file is empty after reading it. So let's start with what I do know. All your file reading can be reduced to this:

    filename = "{}/{}/UNIPROT/{}.html".format(DATA_DIR,
                                              FASTA_OBJECT.organism(),
                                              FASTA_OBJECT.accession()))
    with open(filename) as f:
        uniprotxml = f.read()

format is the newer string formatting function that's recommended over the old % formatting. with is a way of opening files that will always ensure they are closed properly, no matter what. It's also generally recommended. And f.read() just reads the whole file as a string in one go. No need to iterate over the file to read line by line and add that to uniprotxml.

Also you're using recursion to retry. But that's a messy solution. Instead consider using a loop, and call either return or break to end the loop when you get a successful result, but otherwise loop indefinitely (or for a limited number of tries).

Tanks a lot! I really appreciate that you went through the full code. All of your suggestions are great, and i learned a lot from them. Thanks again! — Gábor Erdős, Feb 11 at 10:24
@GáborErdős Glad to help, and I'm happy to hear you learned from them. :) — SuperBiasedMan, Feb 11 at 10:30

asked	8 months ago
viewed	68 times
active	8 months ago

current community

your communities

more stack exchange communities

Python webpage parser

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python beginner html xml beautiful-soup or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python webpage parser

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python beginner html xml beautiful-soup or ask your own question.

Related

Hot Network Questions