find a specific string pattern from a file in Unix Shell Scripting

Question

I have the below command.

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep  ":taxonomies-" | head -1

which gives me the output as,

    <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>

However, I need to extract only taxonomies-8.2.0 instead of full string as above.

In the future, please provide an example of your input file, it lets us give you more specific answers than just a single line. For example, my solutions below might work for other lines of your file but I can't know since you haven't shown it. — terdon
– terdon ♦, Commented Oct 21, 2014 at 16:28

terdon · Accepted Answer · 2014-10-21 16:27:39Z

One way is to use grep's -o option, combined with the power of PCREs (-P):

   -o, --only-matching
          Print  only  the  matched  (non-empty) parts of a matching line,
          with each such part on a separate output line.
   -P, --perl-regexp
          Interpret  PATTERN  as  a  Perl  regular  expression  (PCRE, see
          below).  This is highly experimental and grep  -P  may  warn  of
          unimplemented features.

So, you could do

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep -oP ':\Ktaxonomies-[^"]*' | head -1

The \K causes anything matched up to that point to be ignored (so the : is not printed) and [^"]*" means "match as many non-" characters as possible.

Other options include:

sed
```
unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
    sed -n 's/.*:$taxonomies-[^"]*$.*/\1/p' | head -1
```
The -n causes sed to print nothing unless explicitly told to and the s/// is the substitution operator. It will replace everything on the line with the part of the line between the parentheses (\1). The p causes the resulting line to be printed.

Perl

unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
  perl -lne 's/.*:(taxonomies-[^"]).*/$1/ && print' | head -1

The same basic idea as the sed. If the substitution was successful, the line is printed. An alternative would be

unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
  perl -lne '/.*:(taxonomies-[^"])/ && print $1' | head -1

Still not getting expected output. This will give you correct output [cpc22776 141029_134901]$ unzip -p GLP.K4C.S1BB7.BG49087-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-07 but when it has taxonomy verison it prints . after 8---- cpclb2a670:/usr/local/afs7/PaF/LNK4C/C2B_75/LN_input/zip_processed/0KFQ/FRA-DEV_176YYY-K4C_28Oct/141029_062955_load $ unzip -p GLP.K4C.S0700.BG75448-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-8. — Atil Thakor
– Atil Thakor, Commented Oct 29, 2014 at 19:05
@AtilThakor yes, that's why I said you need to show your inout file. Please edit your question and add an example of your file. — terdon
– terdon ♦, Commented Oct 29, 2014 at 21:54

Community · Accepted Answer · 2017-04-13 12:36:51Z

1

If you know the occurence of : character in your input, you could do something like this.

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
awk -F\: '{print $4}' | sed 's/..$//'

The awk command prints the 4^th string after the : delimiter and the sed command is used to remove the last 2 characters to get the desired output.

However, if this method works or not depends on your input as terdon points out in his comments.

EDIT

The final pipe to sed could very well be avoided if we use the solution as suggested by jasonwryan in the comments. So, the command would effectively be rephrased as,

 echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
 awk -F: '{sub(/\/>/,""); print $4}'

Another solution just using cut and rev can be framed as,

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
cut -d ':' -f4 | rev | cut -c 3- | rev

Again the specifying of delimiter is dependent on the input file and from the example you have provided the characters that I need to extract occur after the 4^th position of the delimiter. I use cut to extract the substring after this 4^th delimiter and use good old rev technique to reverse the string and remove the last 3 characters and again apply rev on it to get the actual string.

edited Apr 13, 2017 at 12:36

CommunityBot

1

answered Oct 21, 2014 at 16:29

Ramesh

40.4k44 gold badges148 silver badges222 bronze badges

...or split fields on two delimiters (-F"[:"]) or use gsub to avoid the pipe to sed.

jasonwryan
– jasonwryan

10/21/2014 16:31:45
Commented Oct 21, 2014 at 16:31
@jasonwryan, Thanks. But, I am afraid I am not following your suggestion. Could you please let me know how I could improve the command further?

Ramesh
– Ramesh

10/21/2014 16:39:02
Commented Oct 21, 2014 at 16:39
@jasonwryan, added another example without using awk or sed. Hope this one is little better.

Ramesh
– Ramesh

10/21/2014 16:46:49
Commented Oct 21, 2014 at 16:46
1

Either of these will work (and don't require another process): awk -F: '{sub(/\/>/,""); print $4}' or awk -F'[:/]' '{print $4}': let Awk do the lifting... :)

jasonwryan
– jasonwryan

10/21/2014 18:02:35
Commented Oct 21, 2014 at 18:02
@jasonwryan, thanks a lot. I added your suggestion to the answer. :)

Ramesh
– Ramesh

10/21/2014 18:08:25
Commented Oct 21, 2014 at 18:08

Add a comment |

Stack Exchange Network

find a specific string pattern from a file in Unix Shell Scripting

2 Answers 2

You must log in to answer this question.

Hot Network Questions

find a specific string pattern from a file in Unix Shell Scripting

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions