find a specific string pattern from a file in Unix Shell Scripting

Question

I have the below command.

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep  ":taxonomies-" | head -1

which gives me the output as,

    <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>

However, I need to extract only taxonomies-8.2.0 instead of full string as above.

In the future, please provide an example of your input file, it lets us give you more specific answers than just a single line. For example, my solutions below might work for other lines of your file but I can't know since you haven't shown it. — terdon♦, Oct 21 '14 at 16:28

Ramesh · Answer 1 · 2014-10-21 18:16:42Z

up vote 1 down vote

If you know the occurence of : character in your input, you could do something like this.

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
awk -F\: '{print $4}' | sed 's/..$//'

The awk command prints the 4^th string after the : delimiter and the sed command is used to remove the last 2 characters to get the desired output.

However, if this method works or not depends on your input as terdon points out in his comments.

EDIT

The final pipe to sed could very well be avoided if we use the solution as suggested by jasonwryan in the comments. So, the command would effectively be rephrased as,

 echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
 awk -F: '{sub(/\/>/,""); print $4}'

Another solution just using cut and rev can be framed as,

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
cut -d ':' -f4 | rev | cut -c 3- | rev

Again the specifying of delimiter is dependent on the input file and from the example you have provided the characters that I need to extract occur after the 4^th position of the delimiter. I use cut to extract the substring after this 4^th delimiter and use good old rev technique to reverse the string and remove the last 3 characters and again apply rev on it to get the actual string.

edited Oct 21 '14 at 18:16

answered Oct 21 '14 at 16:29

Ramesh
13k1144111

...or split fields on two delimiters (-F"[:"]) or use gsub to avoid the pipe to sed. – jasonwryan Oct 21 '14 at 16:31

@jasonwryan, Thanks. But, I am afraid I am not following your suggestion. Could you please let me know how I could improve the command further? – Ramesh Oct 21 '14 at 16:39

@jasonwryan, added another example without using awk or sed. Hope this one is little better. – Ramesh Oct 21 '14 at 16:46

1

Either of these will work (and don't require another process): awk -F: '{sub(/\/>/,""); print $4}' or awk -F'[:/]' '{print $4}': let Awk do the lifting... :) – jasonwryan Oct 21 '14 at 18:02

@jasonwryan, thanks a lot. I added your suggestion to the answer. :) – Ramesh Oct 21 '14 at 18:08

add a comment |

terdon · Answer 2 · 2014-10-21 16:27:39Z

One way is to use grep's -o option, combined with the power of PCREs (-P):

   -o, --only-matching
          Print  only  the  matched  (non-empty) parts of a matching line,
          with each such part on a separate output line.
   -P, --perl-regexp
          Interpret  PATTERN  as  a  Perl  regular  expression  (PCRE, see
          below).  This is highly experimental and grep  -P  may  warn  of
          unimplemented features.

So, you could do

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep -oP ':\Ktaxonomies-[^"]*' | head -1

The \K causes anything matched up to that point to be ignored (so the : is not printed) and [^"]*" means "match as many non-" characters as possible.

Other options include:

sed
```
unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
    sed -n 's/.*:$taxonomies-[^"]*$.*/\1/p' | head -1
```
The -n causes sed to print nothing unless explicitly told to and the s/// is the substitution operator. It will replace everything on the line with the part of the line between the parentheses (\1). The p causes the resulting line to be printed.

Perl

unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
  perl -lne 's/.*:(taxonomies-[^"]).*/$1/ && print' | head -1

The same basic idea as the sed. If the substitution was successful, the line is printed. An alternative would be

unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
  perl -lne '/.*:(taxonomies-[^"])/ && print $1' | head -1

Still not getting expected output. This will give you correct output [cpc22776 141029_134901]$ unzip -p GLP.K4C.S1BB7.BG49087-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-07 but when it has taxonomy verison it prints . after 8---- cpclb2a670:/usr/local/afs7/PaF/LNK4C/C2B_75/LN_input/zip_processed/0KFQ/FRA-DEV_‌176YYY-K4C_28Oct/141029_062955_load $ unzip -p GLP.K4C.S0700.BG75448-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-8. — Atil Thakor, Oct 29 '14 at 19:05
@AtilThakor yes, that's why I said you need to show your inout file. Please edit your question and add an example of your file. — terdon♦, Oct 29 '14 at 21:54

asked	11 months ago
viewed	1005 times
active	11 months ago

current community

your communities

more stack exchange communities

find a specific string pattern from a file in Unix Shell Scripting

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged shell-script or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

find a specific string pattern from a file in Unix Shell Scripting

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged shell-script or ask your own question.

Related

Hot Network Questions