I have a huge text file called dictionary.txt with entries like

    ABC_SEQ_NUM This represents....
    ABC_RANK This represents....
    ABC_BSC_ID This represents...
    PQR_TA_DATE_AF This represents...
    XYZ_C_ID This represents...

In another file, I have the source for a program that is using some of these abbreviations as part of its variable names. The variable names often use the above entries as follows

     Facilitator.TMP_ABC_SEQ_NUM 

So I am not able to simply search for TMP_ABC_SEQ_NUM using grep, because it would return no match. However, the last part of the variable name ("ABC_SEQ_NUM") is actually present in the text file.

So I would like to say something like

      grep (longest match for) TMP_ABC_SEQ_NUM in dictionary.txt

So that it would return the match for

      ABC_SEQ_NUM

How to write such a command?

link|improve this question
feedback

3 Answers

up vote 2 down vote accepted

This would try to match from the beginning:

t=TMP_ABC_SEQ_NUM
for n in $(seq 0 ${#t})
do
  grep ${t:n} dictionary.txt && break
done

This searches for the longest sequence, no matter where it starts:

for len in $(seq ${#t} -1 3)
do
   for start in $(seq 0 $((${#t}-len)))
   do
       grep ${t:start:len} dictionary.txt && break 2
   done
done 

requirement: A bash-like shell, available here: native win32 ports of many GNU-utils, like sh.exe, grep, sed, awk, bc, cat, tac, rev, col, cut, ...

link|improve this answer
feedback

A possible approach, to shorten the string from the head until it matches:

#!/bin/sh
string="TMP_ABQ_SEQ_NUM"
while ! grep "$string" dictionary.txt; do 
  # remove the shortest leading string ending with "_"
  string="${string#*_}"
done
link|improve this answer
Would this work under Windows? I am using grep under Windows. – CodeBlue Apr 2 at 14:29
@CodeBlue: it do not depends on grep, but on the availability of a POSIX shell. It is surely available through Cygwin. – enzotib Apr 2 at 14:30
What if the string was FOO_ABQ_SEQ_NUM_BAR? – Gilles Apr 2 at 23:18
@Gilles: the problem does not seem to be well defined, so mine was only a tentative solution – enzotib Apr 3 at 5:45
feedback

Could you reverse the way you're looking at this? Rather than looking for TMP_ABQ_SEQ_NUM in dictionary.txt, could you not look for the first field for each line in dictionary.txt (the ABQ_SEQ_NUM) in the source file?

If this is the case, the following should work

#!/bin/bash
for i in $(awk '{print $1}' dictionary.txt) do
    grep $i $1
done

Pass the above script the name of the file you want to check for sequences present in dictionary.txt. Apologies if this isn't what you wanted.

link|improve this answer
feedback

Your Answer

 
or
required, but never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.