Sign up ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

I have a collection of XML files in a standard format that I'd like to search to see if they match two strings.

Here is the idea:

<ELEMENT1>Dave</ELEMENT>
<DON'TCARE1>Blaa</DON'TCARE2>
<DON'TCARE2>Blaa2</DON'TCARE2>
<ELEMENT2>History</ELEMENT2>

How can I match the content of ELEMENT1 and ELEMENT2 with egrep and return the filename that contains them?

share|improve this question
    
You should first read this to realize how bad it is to use regexps to parse HTML or XML : stackoverflow.com/questions/1732348/…. To look for an element in a XML file, use an XPath expression instead. –  lgeorget Mar 7 '14 at 12:43
    
Shouldn't it be </ELEMENT1> instead of </ELEMENT> above? –  Stéphane Chazelas Mar 7 '14 at 12:48

2 Answers 2

up vote 3 down vote accepted

With recent GNU grep built with recent PCRE:

grep -Po '<(ELEMENT[12]>)\K.*?(?=</\1)'
share|improve this answer

The following XQuery should give you the desired output :

for $x in (/content/element1,/content/element2)
return $x/text()

For example, with an XQuery interpreter such as XQilla and an input file like

<?xml version="1.0" ?>
<content>
   <element1>truc</element1>
   <dontcare>blah</dontcare>
   <dontcare>blah</dontcare>
   <element2>truc2</element2>
   <dontcare>blah</dontcare>
   <dontcare>blah</dontcare>
</content>

xqilla -i 1.xml 1.query outputs

truc
truc2

For your example, regexps might be sufficient but in the general case it's a bad idea to use them for XML parsing because XML is not a regular language (i.e. a language parsable with regular expressions).

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.