HTML downloader and parser - Would you review my code?

Question

Good morning! (or afternoon, or even evening!)

This program downloads CodeReview HTML file and parses it.

Could you review my program?

There are 2 classes:

First - main.java

import java.net.URL;

public class Main {
    public static void main(String[] args) throws Exception {
        final String site1 = "http://codereview.stackexchange.com/questions/";
        String site2;
        String site3;
        URL url;
        HtmlGetter htmlGetter = new HtmlGetter();

        while(true) {
            site2 = "69";
            site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
            url = new URL(site1 + site2 + "/" + site3);
            htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
            htmlGetter.download(url);
            htmlGetter.parse();
            break;
        }
    }
}

Second - HtmlGetter.java

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Scanner;

import org.apache.commons.lang3.StringEscapeUtils; 
import org.apache.commons.io.IOUtils;

public class HtmlGetter {
    private String fileName;

    public void setFileName(String fileName) {
        this.fileName = fileName;
    }

    public String getFileName() {
        return fileName;
    }

    public void download(URL url) throws Exception {
        final InputStream in = url.openStream();
        final OutputStream out = new FileOutputStream(fileName);
        IOUtils.copy(in, out);
        in.close();
        out.close();
    }

    public void parse() throws Exception {
        Scanner scanner = new Scanner(new File(fileName));
        String oneLine;
        String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

        while(scanner.hasNextLine()) {
            oneLine = scanner.nextLine();
            if(oneLine.matches(".*<code>.*")) {
                while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {
                    oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
                    System.out.println(oneLine);
                    oneLine = scanner.nextLine();
                    // This cannot handle such kind of a line which has <code> and </code> together.
                }
                System.out.println("-------------------------------------------");
            }
        }           
    }
}

Thank you in advance!!! :)

Winston Ewert · Accepted Answer · 2012-05-06 00:39:00Z

public static void main(String[] args) throws Exception {
    final String site1 = "http://codereview.stackexchange.com/questions/";
    String site2;
    String site3;

Why do define site1 as a final constant, but delay the other assignments until later? They should really be static finals for the class. Also some clearer names would not go amiss.

    URL url;

Declare them where they are used, don't declare them ahead of time.

    HtmlGetter htmlGetter = new HtmlGetter();

    while(true) {
        site2 = "69";
        site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
        url = new URL(site1 + site2 + "/" + site3);
        htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
        htmlGetter.download(url);
        htmlGetter.parse();
        break;

Why do you have a loop if you are just going to break out of it? } }

public void parse() throws Exception {

You shouldn't throw Exception, your specification should really only include the exceptions that'll actually be thrown

    Scanner scanner = new Scanner(new File(fileName));
    String oneLine;
    String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

Parsing HTML by regex is asking for trouble. Instead, you'll find the task much easier if you grab a html parsing library and use that.

    while(scanner.hasNextLine()) {
        oneLine = scanner.nextLine();
        if(oneLine.matches(".*<code>.*")) {

You are just searching for a substring here, there isn't much point in using regex to do that.

            while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {

If you insist on regex, use a single regex to capture the entire code tag.

                oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
                System.out.println(oneLine);
                oneLine = scanner.nextLine();
                // This cannot handle such kind of a line which has <code> and </code> together.
            }
            System.out.println("-------------------------------------------");
        }
    }           
}

Using something like jsoup: http://jsoup.org/

Document question = Jsoup.connect("http://codereview.stackexchange.com/questions/69").get();
Elements codes = question.select("code")
for(Element code: codes)
{
   System.out.println(code.text())
}

Thank you for teaching me jsoup!!! – S.J. May 7 '12 at 12:03

asked	2 years ago
viewed	245 times
active	2 years ago

current community

your communities

more stack exchange communities

HTML downloader and parser - Would you review my code?

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged java html or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

HTML downloader and parser - Would you review my code?

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java html or ask your own question.

Linked

Related

Hot Network Questions