1

I want to parse a html string using php (Simple number matching).

<i>1002</i><i>999</i><i>344</i><i>663</i>

and I want the result as an array. eg: [1002,999,344,633,...] I tried like this :

<?php
    $html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
    if(preg_match_all("/<i>[0-9]*<\/i>/",$html, $matches,PREG_SET_ORDER))
        foreach($matches as $match) {
            echo strip_tags($match[0])."<br/>";
        }
?>

and I got the exact output which I want.

1002
999
344
663

But when I try the same code by making a small change in regular expression I'm getting different answer.

Like this:

<?php
    $html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
    if(preg_match_all("/<i>.*<\/i>/",$html, $matches,PREG_SET_ORDER))
        foreach($matches as $match) {
            echo strip_tags($match[0])."<br/>";
        }
?>

Output :

1002999344663

(The regular expression matched the entire string.)

Now I want to know why I'm getting like this? What is the difference if use .* (zero or more) instead of [0-9]* ?

3
  • 4
    * is greedy by default. Commented Feb 19, 2013 at 21:54
  • K. So what is '?' there. Commented Feb 19, 2013 at 22:02
  • @VishalVijay: I'll explain that in an answer :P Commented Feb 19, 2013 at 22:03

1 Answer 1

1

The .* in your regex matches any character ([0-9]* only matches numbers and </i><i> isn't a number). The regex /<i>.*<\/i>/ matches:

<i>1002</i><i>999</i><i>344</i><i>663</i>
^ from here ------------------- to here ^

Since, the whole string is inside <i></i>.

This is because * is greedy. It takes the max amount of characters it can match.

To fix your problem, you need to use .*?. This makes it takes the minimum amount of characters it can match.

The regex /<i>.*?<\/i>/ will work as you want.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.