Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them, it only takes a minute:

I am stuck with some regular expression problem.

I have a huge file in html and i need to extract some text (Model No.) from the file.

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......

<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr> 

.... so on

and this is a huge page with all webpage built in table and divless...

The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.

There are about 10000 model No and i need to extract them.

is there any way do do this with regrex... like

"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"

and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.

any help would be greatly appreciated...

share|improve this question
1  
Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Jun 16 '13 at 19:31

4 Answers 4

up vote 2 down vote accepted

Description

This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.

<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>

enter image description here

Groups

Group 0 gets the entire td tag from open tag to close tag

  1. gets the open quote around the class value to ensure the correct closing capture is also found
  2. get the desired text

PHP Code Example:

Input text

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 


<table>/.....
<td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td></tr> 

Code

<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
            [1] => <td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => SK10014
            [1] => SK1998
        )

)
share|improve this answer
    
That worked like a charm.. Amazing... Thanks Man. – Gaurav Mehra Jun 17 '13 at 3:05

Method with DOMDocument:

// $html stands for your html content
$doc = new DOMDocument();
@$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');

foreach($td_nodes as $td_node){
    if ($td_node->getAttribute('class')=='thumimages')
        echo $td_node->firstChild->textContent.'<br/>';
 }

Method with regex:

$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class" 
class \s*+ = \s*+              # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1       # "thumimages" between quotes or not 
(?>[^>]++|(?<!b)>)+>           # all characters until the ">" from "<b>"
\s*+  \K                       # any spaces and pattern reset

[^<\s]++                    # all chars that are not a "<" or a space
~xi
LOD;

preg_match_all($pattern, $html, $matches);

echo '<pre>' . print_r($matches[0], true);
share|improve this answer
    
I agree that HTML parsing is probably the best solution, however the requester did leave a comment on another answer here saying that the html source code was poorly formatted and was dropping validation errors. – Denomales Jun 16 '13 at 20:36
/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i

This works.

share|improve this answer
    
I am getting a blank arrays with this.. Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) ).... – Gaurav Mehra Jun 16 '13 at 19:39
    
I used preg_match_all('|(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)|i', $content, $matchesarray); – Gaurav Mehra Jun 16 '13 at 19:40
    
It works here: gskinner.com/RegExr – tntu Jun 16 '13 at 19:46
1  
I think you need to escape with a \ certain html characters like / " and perhaps = – tntu Jun 16 '13 at 19:48

You can use php DOMDocument Class

<?php
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('load.html');
    $xpath = new DOMXPath($dom);

     foreach($xpath->query('//tr') as $tr){
        echo $xpath->query('.//td[@class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
     }
?>
share|improve this answer
    
Tried it but the document contains to many html validation errors. – Gaurav Mehra Jun 16 '13 at 19:50

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.