Extracting a regex matched with 'sed' without printing the surrounding characters

Question

To all the 'sed' doctors out there:

I have a seemingly trivial 'sed' question to which I have not been able to find a solution.

How can you get 'sed' to exctract a regular expression it has matched in a line?

In other words words, I want just the string corresponding to the regular expression with all the non-matching characters from the containing line stripped away.

I tried using the back-reference feature like below

regular expression to be isolated 
         gets `inserted` 
              here     
               |
               v  
 sed -n 's/.*\( \).*/\1/p

this works for some expressions like

 sed -n 's/.*\(CONFIG_[a-zA-Z0-9_]*\).*/\1/p

which neatly extracts all macro names starting with 'CONFIG_ ....' ( found in some '*.h' file ) and prints them all out line by line

          CONFIG_AT91_GPIO
          CONFIG_DRIVER_AT91EMAC
                   .
                   .   
          CONFIG_USB_ATMEL
          CONFIG_USB_OHCI_NEW
                   .
                 e.t.c.

BUT the above breaks down for something like

  sed -n 's/.*\([0-9][0-9]*\).*/\1/p

this always returns single digits like

rather than extracting a contigious number field such as.

P.S.: I would be grateful to feedback on how this is achieved in 'sed'. I know how to do this with 'grep' and 'awk' I would like to find out if my - albeit limited - understanding of 'sed' has holes in it and if there is way to do this in 'sed' which I
have simply overlooked.

Gilles · Accepted Answer · 2012-02-12 15:08:14Z

When a regexp contains groups, there may be more than one way to match a string against it: regexps with groups are ambiguous. For example, consider the regexp ^.*$[0-9][0-9]*$$ and the string a12. There are two possibilities:

Match a against .* and 2 against [0-9]*; 1 is matched by [0-9].
Match a1 against .* and the empty string against [0-9]*; 2 is matched by [0-9].

Sed, like all other regexp tools out there, applies the earliest longest match rule: it first tries to match the first variable-length portion against a string that's as long as possible. If it finds a way to match the rest of the string against the rest of the regexp, fine. Otherwise, sed tries the next longest match for the first variable-length portion and tries again.

Here, the match with the longest string first is a1 against .*, so the group only matches 2. If you want the group to start earlier, some regexp engines let you make the .* less greedy, but sed doesn't have such a feature. So you need to remove the ambiguity. Specify that the leading .* cannot end with a digit, so that the first digit of the group is the first possible match.

If the group of digits cannot be at the beginning of the line:
```
sed -n 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p
```
If the group of digits can be at the beginning of the line, and your sed supports the \? operator for optional parts:
```
sed -n 's/^$.*[^0-9]$\?$[0-9][0-9]*$.*/\1/p
```
If the group of digits can be at the beginning of the line, sticking to standard regexp constructs:
```
sed -n -e 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p -e t -e 's/^$[0-9][0-9]*$.*/\1/p
```

By the way, it's that same earliest longest match rule that makes [0-9]* match the digits after the first one, rather than the subsequent .*.

Note that if there are multiple sequences of digits on a line, your program will always extract the last sequence of digits, again because of the earliest longest match rule applied to the initial .*. If you want to extract the first sequence of digits, you need to specify that what comes before is a sequence of non-digits.

sed -n 's/^[^0-9]*\([0-9][0-9]*\).*$/\1/p'

More generally, to extract the first match of a regexp, you need to compute the negation of that regexp. While this is always theoretically possible, the size of the negation grows exponentially with the size of the regexp you're negating, so this is often impractical.

Consider your other example:

sed -n 's/.*\(CONFIG_[a-zA-Z0-9_]*\).*/\1/p

This example actually exhibits the same issue, but you don't see it on typical inputs. If you feed it hello CONFIG_FOO_CONFIG_BAR, then the command above prints out CONFIG_BAR, not CONFIG_FOO_CONFIG_BAR.

There's a way to print the first match with sed, but it's a little tricky:

sed -n -e 's/\(CONFIG_[a-zA-Z0-9_]*\).*/\n\1/' -e T -e 's/^.*\n//' -e p

(Assuming your sed supports \n to mean a newline in the s replacement text.) This works because sed looks for the earliest match of the regexp, and we don't try to match what precedes the CONFIG_… bit. Since there is no newline inside the line, we can use it as a temporary marker. The T command says to give up if the preceding s command didn't match.

When you can't figure out how to do something in sed, turn to awk. The following command prints the earliest longest match of a regexp:

awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}'

And if you feel like keeping it simple, use Perl.

perl -l -ne '/[0-9]+/ && print $&'       # first match
perl -l -ne '/^.*([0-9]+)/ && print $1'  # last match

Hello Gilles, just had a chance to check out your reply and I must thank you since your answer was an eye opener because as a result of this problem and your feedback I did some research and found out quite a bit on greediness of the '*', '+' and '{}', how to make them lazy and most importantly the backtracking feature of regular expression engines. I grep a lot and have been trying add sed to my repertoire. Time well spent learning about regex's! Although I had a hunch that relying on sed alone at times might be verging on masochism. Thanks for pointing me in the direction of awk and perl. — darbehdar, Feb 16 '12 at 10:34

asked	1 year ago
viewed	5135 times
active	1 year ago

Explore our sites

Extracting a regex matched with 'sed' without printing the surrounding characters

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged sed or ask your own question.

Community Bulletin

Linked

Hot Network Questions

Explore our sites

Extracting a regex matched with 'sed' without printing the surrounding characters

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged sed or ask your own question.

Community Bulletin

Linked

Related

Hot Network Questions