1

I am trying to Parse a html file using Python without using any external module. The reason is I am triggering a jenkins job and running into some import issues with lxml and BeautifulSoup (tried resolving it and I think somewhere I am doing over engineering to get my stuff done)

Input:

    <tr class="test">
    <td class="test">
      <a href="a.html">BA</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="o.html">Aa</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="g.html">GG</a>
    </td>
    <td class="duration">
      0.390s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="suite">
    <td colspan="2" class="totalLabel">Total</td>

        <td class="zero number">271</td>

        <td class="zero number">0</td>

        <td class="fail number">3</td>

        <td class="zero number">4</td>


    <td class="passRate suite">
            98%
          </td>

  </tr>

Output:

I want to take that specific block of tr tag with the class "suite" (check at the end) and then pull the values for all the td tags and assign too.

~~~~~~~~~~~~~~~~~~~~~~~~~~

Eg. The output will be: 
271
   0
     3
       4
         98%

Finally I want to assign these values to the variables...so my final output will be: A = 271 B = 0 C = 3 D = 4 D = 98%

(all variables in new lines)

~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is what I tried with lxml:

tree = parse(HTML_FILE)
tds = tree.xpath("//tr[@class='suite']//td/text()")
val = map(str.strip, tds)

This works out locally but I really want to do something without any external dependencies. Shall I use strip() or open a file using os.path.isFile(). I may not be correct but advise/walk me through what would be solution to do this.

**The most difficult part that I can think of is "in the last tr tag block of my input, couple of the sub td tags have class = zero number" and so how do you solve it.

**the approach I could think of is take out that block and then remove all the tags except the content and then assign line by line. However, I am not good at regular expressions.

This is not the duplicate of Parse HTML file using Python without external module ...this is a different input and different output expected question.

3
  • Why do you need to do without any external dependencies? Commented Feb 4, 2016 at 19:02
  • @alecxe I will be triggering a jenkins job for automation stuff and so, in order to use external dependencies such as lxml, BeautifulSoup which are no-doubt useful, I will need to setup virtual env to do pip install. It will involve an overhead of configuring all of my 20 slave machines to support virtual env. So, I really want to do this with regular expressions.... Commented Feb 4, 2016 at 19:34
  • Guys any suggestions on how to solve the question using regular expressions? Commented Feb 5, 2016 at 5:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.