Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have a working perl regex using grep. I am trying to understand how it works.

Here is the command command.

grep -oP '(?<=location>)[^<]+' testFile1.xml

Here are the contents of testFile1.xml

<con:location>C:/test/file1.txt</con:location></con:dataFile>/con:dataFiles></con:groupFile>

And this is the result

C:/test/file1.txt

I am trying to understand the regex, i.e. this part (?<=location>)[^<]+

share|improve this question
up vote 7 down vote accepted

(?<=...) is a look-behind PCRE operator. By itself, it doesn't match anything but acts as a condition (that what's on the left matches ...).

(?<=X)Y matches Y provided that what's on the left matches X. In blahYfooXYbar, that matches the second Y, the X is not part of what is being matched. The (?<=X) itself matches the zero-width (imaginary) spot just before that Y. Here illustrated:

$ echo X-RAY THE FOX | perl -lpe 's/(?<=X)/<there>/g'
X<there>-RAY THE FOX<there>

Because with -o, grep only prints the matched portion, that's a way to make it print what's after the location> (here what matches [^>]+: one or more (+) non-< characters ([^>]) so everything up to (but not included) the next < character or the end of the line provided it's not empty).

Another approach is to use \K (in newer versions of PCRE) to reset the start of the matched portion:

grep -Po 'location>\K[^>]+'

Note that -P and -o are GNU extensions. With recent versions (8.11 or over) of pcregrep (another grep implementation that uses PCRE), you can also do:

pcregrep -o1 'location>([^>]+)'

(-o1 prints what's captured by the 1st (here unique) (...))

share|improve this answer
    
I under stand the look-behind part. I am still not sure what does "[^<]+" mean? does it mean anything starts with "<" symbol? – Sas 2 days ago
    
also I tried the pcregrep and it's says "pcregrep: Unknown option letter '1' in "-o1" – Sas 2 days ago
    
@Sas, See if my latest edit make it clearer. – Stéphane Chazelas 2 days ago
    
@Sas, -o<n> was added in 8.11 (Dec 2010). You probably have an older version. (I do mention recent versions in my answer though I hadn't realised it was almost 6 years ago it was added. Time flies...). – Stéphane Chazelas 2 days ago
1  
@Sas If you don't understand basics like [^abc], then I recommend reading through the docs once or twice. Especially for PCRE, this is time well spent, as almost every programming language that implements regexes will use syntax & semantics similar to this. (But, keep in mind that POSIX/shell regexes are fairly different; those are the two most common variants one needs to use.) – jpaugh yesterday

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.