Regex extract html source with multiple elements

Question

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.

What I need is to take the following example html and extract each line

  <b>Item 1</b> Text <br>
  <b>Item 2</b> Text <br>
  <b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>

What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to
/*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing  in the middle.

can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)

This is all quite new to me, so hope ive explained what im trying to do.

Thanks in advance

I can't really get what are the inputs and the actual/expected outputs. Could you provide a jsFiddle illustrating your needs?
@sp00m Ive created a jsFiddle (as best I can) at link. Although when i run it, it seems to do what is expected, I execute it through the actual data extractor and it does nothing!

beiller · Answer 1 · 2013-05-15 13:41:05Z

I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.

What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.

in python add re.DOTALL option. In PHP there is a special slash tag to use.

<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>

Thanks I have tried this code and unfortunately neither RegExr or the Data Extractor software seems to like it. I am working with Javascript. It doesn't seem to like the quotation marks.
(.*?) .*?(.*?) (.*?) <p.*?sans-serif.*?>(.*?).*?serif.*‌?>(.*?) - sans quotation marks

Tom · Accepted Answer · 2013-05-17 03:53:50Z

For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*? )/gi works brilliantly.

Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.

Thanks so much for responding and trying to help

asked	17 days ago
viewed	56 times
active	15 days ago

Regex extract html source with multiple elements

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged html regex or ask your own question.

Community Bulletin

Regex extract html source with multiple elements

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged html regex or ask your own question.

Community Bulletin

Related