I am trying to Parse a html file using Python without using any external module. The reason is I am triggering a jenkins job and running into some import issues with lxml and BeautifulSoup (tried resolving it and I think somewhere I am doing over engineering to get my stuff done)
Input:
<tr class="test">
<td class="test">
<a href="a.html">BA</a>
</td>
<td class="duration">
0.000s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="test">
<td class="test">
<a href="o.html">Aa</a>
</td>
<td class="duration">
0.000s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="test">
<td class="test">
<a href="g.html">GG</a>
</td>
<td class="duration">
0.390s
</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="zero number">0</td>
<td class="passRate">
N/A
</td>
</tr>
<tr class="suite">
<td colspan="2" class="totalLabel">Total</td>
<td class="zero number">271</td>
<td class="zero number">0</td>
<td class="fail number">3</td>
<td class="zero number">4</td>
<td class="passRate suite">
98%
</td>
</tr>
Output:
I want to take that specific block of tr tag with the class "suite" (check at the end) and then pull the values for all the td tags and assign too.
~~~~~~~~~~~~~~~~~~~~~~~~~~
Eg. The output will be:
271
0
3
4
98%
Finally I want to assign these values to the variables...so my final output will be: A = 271 B = 0 C = 3 D = 4 D = 98%
(all variables in new lines)
~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is what I tried with lxml:
tree = parse(HTML_FILE)
tds = tree.xpath("//tr[@class='suite']//td/text()")
val = map(str.strip, tds)
This works out locally but I really want to do something without any external dependencies. Shall I use strip() or open a file using os.path.isFile(). I may not be correct but advise/walk me through what would be solution to do this.
**The most difficult part that I can think of is "in the last tr tag block of my input, couple of the sub td tags have class = zero number" and so how do you solve it.
**the approach I could think of is take out that block and then remove all the tags except the content and then assign line by line. However, I am not good at regular expressions.
This is not the duplicate of Parse HTML file using Python without external module ...this is a different input and different output expected question.