1

I need to parse html meta keywords using regex. Source string is always in same format like:

<meta name="description" content="description text">
<meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
<link rel="alternate" type="application/xml+rss" href="http://example.com/rss">

I wont to get Keyword1, Keyword2 and Keyword3 as List < string >

1

3 Answers 3

2

Description

if you're looking for a simple regex solution and your input isn't complex then you can try this

<meta\b[^>]*\bname=["]keywords["][^>]*\bcontent=(['"]?)((?:[^,>"'],?){1,})\1[>] which will pull the value in the content field.

enter image description here

Group 1 is the open quote, which would then be required to close at the end of the value. Group 2 is the contents which could then be split on the comma.

Disclaimer

This expression could fail on some simple edge cases which is why regex shouldn't be used for parsing HTML, instead you should look to use a html parsing engine.

C# Example

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = "source string to match with pattern";
          Regex re = new Regex(@"<meta\b[^>]*\bname=[""]keywords[""][^>]*\bcontent=(['""]?)((?:[^,>""'],?){1,})\1[>]",RegexOptions.IgnoreCase);
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

$matches Array:
(
    [0] => Array
        (
            [0] => <meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
        )

    [1] => Array
        (
            [0] => "
        )

    [2] => Array
        (
            [0] => Keyword1, Keyword2, Keyword3...
        )

)
2

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

You can use this code to retrieve all Keywords using HtmlAgilityPack

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");

List<String> keyLst= doc.DocumentNode
                        .SelectSingleNode("//meta[@name='keywords']")
                        .Attributes["content"].Value
                        .Split(',').ToList();

keyLst now contain all Keywords

0

I wish I could comment instead of submitting this as an answer, but my rep is too low :(

I understand the need to perform regex sometimes, but as everyone else recommends, it's just preferred to use a standard XML or HTML parser. It's safer in terms of encompassing unintended input variations and can even be faster.

See: https://stackoverflow.com/a/701177/1002098

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.