I am trying to parse HTML using "jsoup". This is my first time working with "jsoup" and I read some tutorial on it as well.
Below is my HTML table, which I am trying to parse.
If you see my table, it has three <tr>
as of now (I have shortened it down to have three table rows just for understanding purpose, but in general it will be more). Now I would like to extract Cluster Name
from my table and its corresponding host name
; so for example, I would extract Titan
as cluster name and all its hostname whose status are down.
As you can see below, for Titan
cluster name, I have two hostnames machineA.abc.com
and machineB.abc.com
in which machineA
status is up
but machineB
status is down
.
So, I will print out Titan
as cluster name and print out machineB.abc.com
as the hostname, since it is down.
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Goldy</td>
<td>10.100.111.77</td>
<td>machineH.pqr.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
Now if you see in above HTML, I have two cluster names: one is Titan
and the other is Goldy
; so I want to find all the machines which are down for Titan
cluster name only.
Below is my code, which is working fine. So I am opting for code review to see whether we can improve this code slightly.
public static void main(String[] args) throws JSONException, IOException {
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000);
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(3).text().equals("Titan")) {
if (cols.get(7).text().equals("down"))
downServers.add(cols.get(5).text());
do {
if(i < rows.size() - 1)
i++;
row = rows.get(i);
cols = row.select("td");
if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
downServers.add(cols.get(5).text());
}
if(i == rows.size() - 1)
break;
}
while (cols.get(3).text().equals(""));
i--;
}
}
System.out.println(downServers);
}