Parse html meta keywords using regex

Question

I need to parse html meta keywords using regex. Source string is always in same format like:

<meta name="description" content="description text">
<meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
<link rel="alternate" type="application/xml+rss" href="http://example.com/rss">

I wont to get Keyword1, Keyword2 and Keyword3 as List < string >

Do not use a regex. Use an HTMLParser like HTMLAgilityPack.
– Dustin Kingen
Commented May 24, 2013 at 15:26 — Dustin Kingen, Commented May 24, 2013 at 15:26

Ro Yo Mi · Accepted Answer · 2013-05-24 15:40:49Z

Description

if you're looking for a simple regex solution and your input isn't complex then you can try this

<meta\b[^>]*\bname=["]keywords["][^>]*\bcontent=(['"]?)((?:[^,>"'],?){1,})\1[>] which will pull the value in the content field.

enter image description here

Group 1 is the open quote, which would then be required to close at the end of the value. Group 2 is the contents which could then be split on the comma.

Disclaimer

This expression could fail on some simple edge cases which is why regex shouldn't be used for parsing HTML, instead you should look to use a html parsing engine.

C# Example

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = "source string to match with pattern";
          Regex re = new Regex(@"<meta\b[^>]*\bname=[""]keywords[""][^>]*\bcontent=(['""]?)((?:[^,>""'],?){1,})\1[>]",RegexOptions.IgnoreCase);
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

$matches Array:
(
    [0] => Array
        (
            [0] => <meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
        )

    [1] => Array
        (
            [0] => "
        )

    [2] => Array
        (
            [0] => Keyword1, Keyword2, Keyword3...
        )

)

Anirudha · Accepted Answer · 2013-05-24 15:43:08Z

2

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

You can use this code to retrieve all Keywords using HtmlAgilityPack

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");

List<String> keyLst= doc.DocumentNode
                        .SelectSingleNode("//meta[@name='keywords']")
                        .Attributes["content"].Value
                        .Split(',').ToList();

keyLst now contain all Keywords

edited May 24, 2013 at 15:43

answered May 24, 2013 at 15:32

Anirudha

32.9k8 gold badges71 silver badges90 bronze badges

Add a comment |

Community · Accepted Answer · 2017-05-23 11:53:04Z

0

I wish I could comment instead of submitting this as an answer, but my rep is too low :(

I understand the need to perform regex sometimes, but as everyone else recommends, it's just preferred to use a standard XML or HTML parser. It's safer in terms of encompassing unintended input variations and can even be faster.

See: https://stackoverflow.com/a/701177/1002098

edited May 23, 2017 at 11:53

CommunityBot

11 silver badge

answered May 24, 2013 at 15:39

delrocco

4951 gold badge5 silver badges23 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Parse html meta keywords using regex

3 Answers 3

Description

Disclaimer

C# Example

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Description

Disclaimer

C# Example

Your Answer

Sign up or log in

Post as a guest

Linked

Related