Using regexes to find result from HTML table

Question

I am stuck with some regular expression problem.

I have a huge file in html and i need to extract some text (Model No.) from the file.

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......

<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr> 

.... so on

and this is a huge page with all webpage built in table and divless...

The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.

There are about 10000 model No and i need to extract them.

is there any way do do this with regrex... like

"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"

and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.

any help would be greatly appreciated...

Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Jun 16 '13 at 19:31

Denomales · Accepted Answer · 2013-06-16 20:30:00Z

Description

This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.

<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>

enter image description here

Groups

Group 0 gets the entire td tag from open tag to close tag

gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text

PHP Code Example:

Input text

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 


<table>/.....
<td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td></tr>

Code

<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
            [1] => <td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => SK10014
            [1] => SK1998
        )

)

That worked like a charm.. Amazing... Thanks Man. – Gaurav Mehra Jun 17 '13 at 3:05 — Gaurav Mehra, Jun 17 '13 at 3:05

Casimir et Hippolyte · Answer 2 · 2013-06-16 21:10:14Z

Method with DOMDocument:

// $html stands for your html content
$doc = new DOMDocument();
@$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');

foreach($td_nodes as $td_node){
    if ($td_node->getAttribute('class')=='thumimages')
        echo $td_node->firstChild->textContent.'<br/>';
 }

Method with regex:

$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class" 
class \s*+ = \s*+              # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1       # "thumimages" between quotes or not 
(?>[^>]++|(?<!b)>)+>           # all characters until the ">" from "<b>"
\s*+  \K                       # any spaces and pattern reset

[^<\s]++                    # all chars that are not a "<" or a space
~xi
LOD;

preg_match_all($pattern, $html, $matches);

echo '<pre>' . print_r($matches[0], true);

I agree that HTML parsing is probably the best solution, however the requester did leave a comment on another answer here saying that the html source code was poorly formatted and was dropping validation errors. — Denomales, Jun 16 '13 at 20:36

tntu · Answer 3 · 2013-06-16 19:31:54Z

up vote 0 down vote

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i

This works.

answered Jun 16 '13 at 19:31

tntu
3,80431952

I am getting a blank arrays with this.. Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) ).... – Gaurav Mehra Jun 16 '13 at 19:39

I used preg_match_all('|(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)|i', $content, $matchesarray); – Gaurav Mehra Jun 16 '13 at 19:40

It works here: gskinner.com/RegExr – tntu Jun 16 '13 at 19:46

1

I think you need to escape with a \ certain html characters like / " and perhaps = – tntu Jun 16 '13 at 19:48

add a comment |

Khawer Zeshan · Answer 4 · 2013-06-16 19:36:37Z

up vote 0 down vote

You can use php DOMDocument Class

<?php
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('load.html');
    $xpath = new DOMXPath($dom);

     foreach($xpath->query('//tr') as $tr){
        echo $xpath->query('.//td[@class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
     }
?>

answered Jun 16 '13 at 19:36

Khawer Zeshan
4,88821542

Tried it but the document contains to many html validation errors. – Gaurav Mehra Jun 16 '13 at 19:50

add a comment |

asked	2 years ago
viewed	172 times
active	2 years ago

current community

your communities

more stack exchange communities

Using regexes to find result from HTML table

4 Answers 4

Description

Groups

PHP Code Example:

Your Answer

Not the answer you're looking for? Browse other questions tagged php regex html-parsing or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Using regexes to find result from HTML table

4 Answers 4

Description

Groups

PHP Code Example:

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged php regex html-parsing or ask your own question.

Visit Chat

Related

Hot Network Questions