Specific HTML block fetch and parse using Regular Expression (Python)

Ask Question

Asked 9 years, 7 months ago

Modified 9 years, 7 months ago

Viewed 116 times

I am trying to Parse a html file using Python without using any external module. The reason is I am triggering a jenkins job and running into some import issues with lxml and BeautifulSoup (tried resolving it and I think somewhere I am doing over engineering to get my stuff done)

Input:

    <tr class="test">
    <td class="test">
      <a href="a.html">BA</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="o.html">Aa</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="g.html">GG</a>
    </td>
    <td class="duration">
      0.390s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="suite">
    <td colspan="2" class="totalLabel">Total</td>

        <td class="zero number">271</td>

        <td class="zero number">0</td>

        <td class="fail number">3</td>

        <td class="zero number">4</td>


    <td class="passRate suite">
            98%
          </td>

  </tr>

Output:

I want to take that specific block of tr tag with the class "suite" (check at the end) and then pull the values for all the td tags and assign too.

~~~~~~~~~~~~~~~~~~~~~~~~~~

Eg. The output will be: 
271
   0
     3
       4
         98%

Finally I want to assign these values to the variables...so my final output will be: A = 271 B = 0 C = 3 D = 4 D = 98%

(all variables in new lines)

~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is what I tried with lxml:

tree = parse(HTML_FILE)
tds = tree.xpath("//tr[@class='suite']//td/text()")
val = map(str.strip, tds)

This works out locally but I really want to do something without any external dependencies. Shall I use strip() or open a file using os.path.isFile(). I may not be correct but advise/walk me through what would be solution to do this.

**The most difficult part that I can think of is "in the last tr tag block of my input, couple of the sub td tags have class = zero number" and so how do you solve it.

**the approach I could think of is take out that block and then remove all the tags except the content and then assign line by line. However, I am not good at regular expressions.

This is not the duplicate of Parse HTML file using Python without external module ...this is a different input and different output expected question.

edited May 23, 2017 at 11:59

CommunityBot

11 silver badge

asked Feb 4, 2016 at 16:28

Pratik Jaiswal

3077 silver badges26 bronze badges

Why do you need to do without any external dependencies?

alecxe
– alecxe

02/04/2016 19:02:05
Commented Feb 4, 2016 at 19:02
@alecxe I will be triggering a jenkins job for automation stuff and so, in order to use external dependencies such as lxml, BeautifulSoup which are no-doubt useful, I will need to setup virtual env to do pip install. It will involve an overhead of configuring all of my 20 slave machines to support virtual env. So, I really want to do this with regular expressions....

Pratik Jaiswal
– Pratik Jaiswal

02/04/2016 19:34:43
Commented Feb 4, 2016 at 19:34
Guys any suggestions on how to solve the question using regular expressions?

Pratik Jaiswal
– Pratik Jaiswal

02/05/2016 05:00:51
Commented Feb 5, 2016 at 5:00

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Specific HTML block fetch and parse using Regular Expression (Python)

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked