Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Good morning! (or afternoon, or even evening!)

This program downloads CodeReview HTML file and parses it.

Could you review my program?

There are 2 classes:

First - main.java

import java.net.URL;

public class Main {
    public static void main(String[] args) throws Exception {
        final String site1 = "http://codereview.stackexchange.com/questions/";
        String site2;
        String site3;
        URL url;
        HtmlGetter htmlGetter = new HtmlGetter();

        while(true) {
            site2 = "69";
            site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
            url = new URL(site1 + site2 + "/" + site3);
            htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
            htmlGetter.download(url);
            htmlGetter.parse();
            break;
        }
    }
}

Second - HtmlGetter.java

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Scanner;

import org.apache.commons.lang3.StringEscapeUtils; 
import org.apache.commons.io.IOUtils;

public class HtmlGetter {
    private String fileName;

    public void setFileName(String fileName) {
        this.fileName = fileName;
    }

    public String getFileName() {
        return fileName;
    }

    public void download(URL url) throws Exception {
        final InputStream in = url.openStream();
        final OutputStream out = new FileOutputStream(fileName);
        IOUtils.copy(in, out);
        in.close();
        out.close();
    }

    public void parse() throws Exception {
        Scanner scanner = new Scanner(new File(fileName));
        String oneLine;
        String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

        while(scanner.hasNextLine()) {
            oneLine = scanner.nextLine();
            if(oneLine.matches(".*<code>.*")) {
                while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {
                    oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
                    System.out.println(oneLine);
                    oneLine = scanner.nextLine();
                    // This cannot handle such kind of a line which has <code> and </code> together.
                }
                System.out.println("-------------------------------------------");
            }
        }           
    }
}

Thank you in advance!!! :)

share|improve this question
add comment

1 Answer

up vote 3 down vote accepted
public static void main(String[] args) throws Exception {
    final String site1 = "http://codereview.stackexchange.com/questions/";
    String site2;
    String site3;

Why do define site1 as a final constant, but delay the other assignments until later? They should really be static finals for the class. Also some clearer names would not go amiss.

    URL url;

Declare them where they are used, don't declare them ahead of time.

    HtmlGetter htmlGetter = new HtmlGetter();

    while(true) {
        site2 = "69";
        site3 = "is-this-implementation-of-shamos-hoey-algorithm-ok";
        url = new URL(site1 + site2 + "/" + site3);
        htmlGetter.setFileName("[" + site2 + "]" + site3 + ".html");
        htmlGetter.download(url);
        htmlGetter.parse();
        break;

Why do you have a loop if you are just going to break out of it? } }

public void parse() throws Exception {

You shouldn't throw Exception, your specification should really only include the exceptions that'll actually be thrown

    Scanner scanner = new Scanner(new File(fileName));
    String oneLine;
    String htmlTagRegex = "<(/)?([a-zA-Z]*)(\\s[a-zA-Z]*=[^>]*)?(\\s)*(/)?>";

Parsing HTML by regex is asking for trouble. Instead, you'll find the task much easier if you grab a html parsing library and use that.

    while(scanner.hasNextLine()) {
        oneLine = scanner.nextLine();
        if(oneLine.matches(".*<code>.*")) {

You are just searching for a substring here, there isn't much point in using regex to do that.

            while(scanner.hasNextLine()&&(!oneLine.matches(".*</code>.*"))) {

If you insist on regex, use a single regex to capture the entire code tag.

                oneLine = StringEscapeUtils.unescapeHtml4(oneLine.replaceAll(htmlTagRegex, ""));
                System.out.println(oneLine);
                oneLine = scanner.nextLine();
                // This cannot handle such kind of a line which has <code> and </code> together.
            }
            System.out.println("-------------------------------------------");
        }
    }           
}

Using something like jsoup: http://jsoup.org/

Document question = Jsoup.connect("http://codereview.stackexchange.com/questions/69").get();
Elements codes = question.select("code")
for(Element code: codes)
{
   System.out.println(code.text())
}
share|improve this answer
    
Thank you for teaching me jsoup!!! –  S.J. May 7 '12 at 12:03
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.