1

I have the below command.

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep  ":taxonomies-" | head -1

which gives me the output as,

    <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>

However, I need to extract only taxonomies-8.2.0 instead of full string as above.

1
  • In the future, please provide an example of your input file, it lets us give you more specific answers than just a single line. For example, my solutions below might work for other lines of your file but I can't know since you haven't shown it. Commented Oct 21, 2014 at 16:28

2 Answers 2

1

One way is to use grep's -o option, combined with the power of PCREs (-P):

   -o, --only-matching
          Print  only  the  matched  (non-empty) parts of a matching line,
          with each such part on a separate output line.
   -P, --perl-regexp
          Interpret  PATTERN  as  a  Perl  regular  expression  (PCRE, see
          below).  This is highly experimental and grep  -P  may  warn  of
          unimplemented features.

So, you could do

 unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | grep -oP ':\Ktaxonomies-[^"]*' | head -1

The \K causes anything matched up to that point to be ignored (so the : is not printed) and [^"]*" means "match as many non-" characters as possible.

Other options include:

  1. sed

    unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
        sed -n 's/.*:\(taxonomies-[^"]*\).*/\1/p' | head -1
    

    The -n causes sed to print nothing unless explicitly told to and the s/// is the substitution operator. It will replace everything on the line with the part of the line between the parentheses (\1). The p causes the resulting line to be printed.

  2. Perl

    unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
      perl -lne 's/.*:(taxonomies-[^"]).*/$1/ && print' | head -1
    

    The same basic idea as the sed. If the substitution was successful, the line is printed. An alternative would be

    unzip -p GLP.K4C.S06F5.BG57218-rdf.zip | 
      perl -lne '/.*:(taxonomies-[^"])/ && print $1' | head -1
    
2
  • Still not getting expected output. This will give you correct output [cpc22776 141029_134901]$ unzip -p GLP.K4C.S1BB7.BG49087-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-07 but when it has taxonomy verison it prints . after 8---- cpclb2a670:/usr/local/afs7/PaF/LNK4C/C2B_75/LN_input/zip_processed/0KFQ/FRA-DEV_176YYY-K4C_28Oct/141029_062955_load $ unzip -p GLP.K4C.S0700.BG75448-rdf.zip | sed -n 's/.*:(taxonomies-[^"].).*/\1/p' | head -1 taxonomies-8. Commented Oct 29, 2014 at 19:05
  • @AtilThakor yes, that's why I said you need to show your inout file. Please edit your question and add an example of your file. Commented Oct 29, 2014 at 21:54
1

If you know the occurence of : character in your input, you could do something like this.

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
awk -F\: '{print $4}' | sed 's/..$//'

The awk command prints the 4th string after the : delimiter and the sed command is used to remove the last 2 characters to get the desired output.

However, if this method works or not depends on your input as terdon points out in his comments.

EDIT

The final pipe to sed could very well be avoided if we use the solution as suggested by jasonwryan in the comments. So, the command would effectively be rephrased as,

 echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
 awk -F: '{sub(/\/>/,""); print $4}'

Another solution just using cut and rev can be framed as,

echo " <j.2:Taxo_Version rdf:resource="refmat:taxonomies-8.2.0"/>" | 
cut -d ':' -f4 | rev | cut -c 3- | rev

Again the specifying of delimiter is dependent on the input file and from the example you have provided the characters that I need to extract occur after the 4th position of the delimiter. I use cut to extract the substring after this 4th delimiter and use good old rev technique to reverse the string and remove the last 3 characters and again apply rev on it to get the actual string.

5
  • ...or split fields on two delimiters (-F"[:"]) or use gsub to avoid the pipe to sed. Commented Oct 21, 2014 at 16:31
  • @jasonwryan, Thanks. But, I am afraid I am not following your suggestion. Could you please let me know how I could improve the command further? Commented Oct 21, 2014 at 16:39
  • @jasonwryan, added another example without using awk or sed. Hope this one is little better. Commented Oct 21, 2014 at 16:46
  • 1
    Either of these will work (and don't require another process): awk -F: '{sub(/\/>/,""); print $4}' or awk -F'[:/]' '{print $4}': let Awk do the lifting... :) Commented Oct 21, 2014 at 18:02
  • @jasonwryan, thanks a lot. I added your suggestion to the answer. :) Commented Oct 21, 2014 at 18:08

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.