Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I am trying to parse HTML using "jsoup". This is my first time working with "jsoup" and I read some tutorial on it as well.

Below is my HTML table, which I am trying to parse.

If you see my table, it has three <tr> as of now (I have shortened it down to have three table rows just for understanding purpose, but in general it will be more). Now I would like to extract Cluster Name from my table and its corresponding host name; so for example, I would extract Titan as cluster name and all its hostname whose status are down.

As you can see below, for Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.

So, I will print out Titan as cluster name and print out machineB.abc.com as the hostname, since it is down.

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Goldy</td>
      <td>10.100.111.77</td>
      <td>machineH.pqr.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>       
</table>

Now if you see in above HTML, I have two cluster names: one is Titan and the other is Goldy; so I want to find all the machines which are down for Titan cluster name only.

Below is my code, which is working fine. So I am opting for code review to see whether we can improve this code slightly.

public static void main(String[] args) throws JSONException, IOException {
    URL url = new URL("some_url");
    Document doc = Jsoup.parse(url, 3000);

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--;
        }
    }
    System.out.println(downServers);
}   
share|improve this question

1 Answer 1

up vote 4 down vote accepted

If you don't need implementation specific features, declare variables using interface types. That is, instead of:

ArrayList<String> downServers = new ArrayList<>();

Use the List interface:

List<String> downServers = new ArrayList<>();

You can remove the JSONException throws declaration from the main, as it will never happen.


Your main loop can be written a bit simpler, for example:

Iterator<Element> rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;

while (rowIterator.hasNext()) {
    Element row = rowIterator.next();
    Elements cols = row.select("td");
    String clusterName = cols.get(3).text();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (cols.get(7).text().equals("down")) {
            downServers.add(cols.get(5).text());
        }
    } else {
        wasMatch = false;
    }
}

I would also wrap a row in a class with convenient accessors like this:

class ServerInfo {
    final Elements cols;

    ServerRowWrapper(Elements cols) {
        this.cols = cols;
    }

    String getClusterName() {
        return cols.get(3).text();
    }

    String getServerName() {
        return cols.get(5).text();
    }

    boolean isDown() {
        return cols.get(7).text().equals("down");
    }
}

which will make the main loop much more intuitive and free from hardcoded numbers like 3, 5, 7:

while (rowIterator.hasNext()) {
    ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
    String clusterName = serverInfo.getClusterName();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (serverInfo.isDown()) {
            downServers.add(serverInfo.getServerName());
        }
    } else {
        wasMatch = false;
    }
}

Finally, to make future changes easier to test, add some unit tests.

As the first step, extract the middle part from your main method, to take a Document as parameter and return the list of servers that are down:

static List<String> findServersDown(Document document) {
    List<String> downServers = new ArrayList<>();
    // ... rest of the code from main
    return downServers;
}

Change the main method to use the new findServersDown method:

public static void mainx(String[] args) throws IOException {
    URL url = new URL("some_url");
    Document doc = Jsoup.parse(url, 3000);
    System.out.println(ServerInfoParser.findServersDown(doc));
}

Add a unit test (or more!), for example:

public class ServerInfoParserTest {
    @Test
    public void testSecondServerDown() throws IOException {
        Document doc = Jsoup.parse(new File("src/test/resources/test-data/table1.html"), "utf-8");
        List<String> downServers = ServerInfoParser.findServersDown(doc);
        assertEquals(Arrays.asList("machineB.abc.com"), downServers);
    }
}

UPDATE (in response to your comment)

Instead of hardcoding Titan, it's better to generalize. It's easy enough to do.

Pass in a set of names to findServersDown and adjust the if condition in the loop:

static List<String> findServersDown(Document document, Set<String> names) {
    // ...

    while (rowIterator.hasNext()) {
        // ...

        if (wasMatch && clusterName.isEmpty() || names.contains(clusterName)) {
            // ...

Then in the unit test:

Set<String> names = new HashSet<>(Arrays.asList("Titan"));
List<String> downServers = ServerInfoParser.findServersDown(doc, names);

This way the method will return the list of server names in any of the clusters you specified.

If instead of a flat list of server names, if you want them grouped by the name of the cluster, that should be easy to do as well, just rework the return type to a Map<String, Set<String>> instead.

share|improve this answer
    
Thanks a lot Janos. Now the code looks much more simpler. Appreciated your help. One last thing, this might not be related to code review - As of now you can see that I need to find hostnames which are down for Titan cluster. Is it possible to extend this to find the hostnames which are down for Titan and Platinum cluster. We can have a map of string and list<string> - here key will be Titan or Platinum and list will be hostnames which are down.? Is that possible here? –  lining Jul 19 '14 at 19:50
    
Yeah, easily. I added one more point to my answer, see at the bottom. –  janos Jul 19 '14 at 20:21
    
That makes sense, then it means, it will return a single list which will have hosts down for both the cluster. Right? Is that possible to return a Map of String and List<String>, here key will be clusterName and list will be hosts down? –  lining Jul 19 '14 at 20:24
    
Yes, and yes ;-) –  janos Jul 19 '14 at 20:33

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.