Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.

What I need is to take the following example html and extract each line

  <b>Item 1</b> Text <br>
  <b>Item 2</b> Text <br>
  <b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>

What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to
/<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.

can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)

This is all quite new to me, so hope ive explained what im trying to do.

Thanks in advance

share|improve this question
I can't really get what are the inputs and the actual/expected outputs. Could you provide a jsFiddle illustrating your needs? – sp00m May 15 at 13:41
@sp00m Ive created a jsFiddle (as best I can) at link. Although when i run it, it seems to do what is expected, I execute it through the actual data extractor and it does nothing! – Tom May 15 at 13:58

2 Answers

I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.

What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.

in python add re.DOTALL option. In PHP there is a special slash tag to use.

<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>
share|improve this answer
Thanks I have tried this code and unfortunately neither RegExr or the Data Extractor software seems to like it. I am working with Javascript. It doesn't seem to like the quotation marks. – Tom May 15 at 14:00
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif.*?><b>(.*?)</p>.*?serif.*‌​?>(.*?)</p> - sans quotation marks – beiller May 15 at 14:03
Thank you, but still no go yet, i'm stumped!!! – Tom May 15 at 14:19
up vote 0 down vote accepted

For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.

Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.

Thanks so much for responding and trying to help

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.