Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have the following bash script that:

  • Finds all files with .cfc and .cfm extension and converts them to lowercase
  • Stores the relative file paths to those files (filenames.txt)
  • Chops those files to get only the name, excluding the extension (files.txt)
  • Loops through 1500 files checking for references to any of the other 1500 files and converts those to lower case

#!/bin/bash
# Search for references to JS function is all .cfm and .cfc functions
# Prompt to make sure
while true; do
    read -p "All .cfc and .cfm files in the theradoc/ directory and lower directories will be converted. Do you wish to contiue? (y/n)" yn
case $yn in
    [Yy]* ) make install; break;;
    [Nn]* ) exit;;
    * ) echo "Please answer yes or no.";;
esac
done
echo "Renaming files..."
for f in `find theradoc/ -d -name '*.cfc'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
for f in `find theradoc/ -d -name '*.cfm'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
echo "Indexing file names..."
find theradoc/ -d -name '*.cfc' > filenames.txt
find theradoc/ -d -name '*.cfm' >> filenames.txt
echo "Editing file names..."
sed 's/theradoc.*\///g' filenames.txt > tmp.txt
sed 's/\.cf.*//g' tmp.txt > files.txt
rm tmp.txt
echo "Searching all files..."
a=($(wc filenames.txt))
lines=${a[0]}
count=0
while read fn; do
    echo "$fn | $count/$lines finished..."
    while read f; do
        perl -pi -e "s/$f/$f/gi" "$fn"
    done < files.txt
    count=$((count+1))
done < filenames.txt

Runtime: 4 hours

Hardware: MacBook 16GB RAM

I would definitely like to decrease the runtime of this as it may be needed to run on other systems with the same files.

share|improve this question
    
It might be faster to construct one perl script instead of calling perl many times for each file. – choroba 11 hours ago
    
I am willing to learn Perl if needed, as long as its not a complete, time consuming change. This project is winding down and I need to move on from it. Was just looking to improve this script before we ended. – theblindprophet 11 hours ago

In my testing, constructing one Perl script and running it repeatedly is much faster (0.5s versus 3.6s) then running a new Perl instance for each replacement:

while read f; do
    echo "s/$f/$f/gi;" 
done < files.txt > s.pl

while read fn; do
    perl -pi s.pl "$fn"
    echo "$fn | $count/$lines finished..."
    count=$((count+1))
done < filenames.txt

rm s.pl

But it seems even faster (0.05s) to rewrite the whole thing to Perl:

#! /usr/bin/perl
use warnings;
use strict;

use File::Find;

my $dir = 'theradoc2';

my %change;

find(sub {
    return unless -f;
    undef $change{$_};
    rename $_, lc $_;
}, $dir);

my $regex = join '|',
            map quotemeta,
            sort { length $b <=> length $a }
            keys %change;
find(sub {
    return unless -f;
    my $file = $_;
    open my $IN, '<', $file or die $!;
    open my $OUT, '>', "$file.new" or die $!;
    while (<$IN>) {
        s/($regex)/\L$1/g;
        print {$OUT} $_;
    }
    close $OUT or die $!;
    unlink $file or die $!;
    rename "$file.new", $file or die $!;
}, $dir);
share|improve this answer

This might not make a huge difference, but I'd start by combining your find calls. It's been a while, but I think something like this will work (basically, match against *.cf followed by c or m):

find theradoc/ -d -name '*.cf[cm]' > filenames.txt

Or

find theradoc/ -d -name '*.cf?' > filenames.txt

You also don't seem to be using tmp.txt, so I'd tend to just pipe the output from the first sed call into the second:

sed 's/theradoc.*\///g' filenames.txt | sed 's/\.cf.*//g' > files.txt
share|improve this answer

At least as I understand it, you want to get the file names with relative paths. Right now, you're invocation of find produces absolute paths, then you're using sed to remove the prefix you specified on the command line.

Assuming that's correct, you can tell find to produce just the part you want by specifying -printf "%P\n".

For the part that remains (removing the file extensions) you can pipe the data directly from find to sed.

You also probably want to use -depth instead of -d, since the latter is deprecated.

Combining those, we end up with something like this:

find theradoc/ -depth -name *.cf[cm] -printf "%P\n" | sed s/\.cf.*//g > files.txt

I was going to suggest creating a sed script to handle the editing inside the index, but I see @choroba has already suggested roughly the same, so I'll leave that for now.

share|improve this answer

While the main problem of your script is obviously the while read fn; do loop (for which @choroba already provided a good solution I think) I want to point out a few problems which are often appearing in shell scripts:

for f in `find theradoc/ -d -name '*.cfc'`;

It's generally a bad Idea to collect the output of find in a subshell like this, because that might be memory intensive. Also looping over the result by for is discouraged.

There is also no need to employ tr to lowercase variable contents, since bash has a built-in way to do this (but it's probably one of the lesser known features). I also don't see why you need to employ 4 find calls. So in total I think you can replace the whole block

echo "Renaming files..."
for f in `find theradoc/ -d -name '*.cfc'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
for f in `find theradoc/ -d -name '*.cfm'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
echo "Indexing file names..."
find theradoc/ -d -name '*.cfc' > filenames.txt
find theradoc/ -d -name '*.cfm' >> filenames.txt
echo "Editing file names..."
sed 's/theradoc.*\///g' filenames.txt > tmp.txt
sed 's/\.cf.*//g' tmp.txt > files.txt
rm tmp.txt

with

echo "Renaming files..."
find theradoc/ -d -name '*.cf[cm]' | tee filenames.txt | while read f
do
    mv -v $f ${f,,}
done
echo "Editing file names..."
sed 's/theradoc.*\/\(.*\)\.cf/\1/' filenames.txt > files.txt

Of course this alone won't make a difference to your 4 hours runtime.

share|improve this answer

Here are some things that may help improve your code.

Avoid the complication of using make if practical

Rather than invoking make install, I'd instead advocate something like this:

read -p "All .cfc and .cfm files in the $1 directory and lower directories will be converted.  Do you wish to continue?" affirm
case "$affirm" in 
    y|Y ) echo "yes";;
    n|N ) echo "no -- quitting program"; exit;;
    * ) echo "Invalid response -- quitting program"; exit;;
esac

The difference is that now the user doesn't need make. Reducing dependencies results in a more portable script.

Pass in the target directory as a parameter

Rather than hard code theradoc, it might be more convenient, especially for testing, to have the directory name as a command line argument.

Use full path names for security

It's probably better to invoke /usr/bin/sed than just sed because the latter would be easy to substitute. I could have a malicious sed on my path somewhere ahead of the real sed but it's generally harder to overwrite a system file. Specifying a full path uses that to your (and your user's) advantage.

Use find directly

Rather than creating multiple files and processing them, I think you will find that it's faster to simply use find directly rather than redirecting to a file and then processing that file line by line.

Use awk for complex replacement

While perl can do regular expression matching and everything awk can do, you're currently invoking perl once per filename per file. That is, if there are 1500 files, you're invoking perl 1500 * 1500 times = 2,250,000 times. It's already been suggested to just do the whole thing in perl which is certainly another option, although I find perl to be a "write only" language. Once I write it, six months later I find that even I can't read and understand it.

Construct just the replacement list

The only output file you really need is the one containing just the list of base file names. That can be done like this:

find $1 -type f -iname '*.cf[cm]' -exec /usr/bin/basename {} ';' >basename.txt

Note that this uses -exec to run basename to extract just the base part of the name (e.g. /usr/bin/basename would be converted to basename).

Create a bash function to handle the replacement

I'd advocate exporting a function and then invoking find again like this:

export -f replaceAndRename
find $1 -type f -iname '*.cf[cm]' -exec bash -c 'replaceAndRename "$0"' {} \;

The replaceAndRename function might be implemented like this with awk:

replaceAndRename () {
    lcfile="$(echo $(/usr/bin/basename "$1") | tr '[A-Z]' '[a-z]')"
    lcdir="$(/usr/bin/dirname "$1")"
    lcfile="${lcdir}/${lcfile}"
    /usr/bin/awk 'NR==FNR { map[$1]=tolower($1); next }{ 
        for (old in map) {
            gsub(old,map[old])
        }
        print
    }' basename.txt "$1" >"tmp.foo"
    retval="$?"
    if [ "$retval" -eq 0 ]; then
        rm "$1"
        mv tmp.foo "$lcfile"
    fi
}

This probably looks more complicated than it really is. The first three lines simply create a version of the name that uses a lowercase base name. Note, too that if there is something like a subdirectory with a name like My Directory we don't want to alter the directory name -- just the file name.

Next, we invoke awk and pass in the basenames.txt file created by the first invocation of find as well as the current file name to be processed, redirecting the output to a temporary file I've arbitrarily named tmp.foo but one could probably improve that by using mktemp instead.

The awk script reads the first file in and creates an associative array of the original version of the filename mapped to the lowercase version. Next, the second file is scanned and the map is used to make the appropriate substitutions. Finally the print within awk just prints the possibly modified line to the output which is redirected to tmp.foo.

Finally, if awk seemed to run successfully, we remove the original file and the move the tmp.foo into place using the lowercase version of the name. We do it in this order in case one of the files is already in lowercase but may have had file references changed.

Putting it all together

I don't have a convenient way to do benchmarking, but I believe this version of the script will improve your speed.

#!/bin/bash
read -p "All .cfc and .cfm files in the $1 directory and lower directories will be converted.  Do you wish to continue?" affirm
case "$affirm" in 
    y|Y ) echo "yes";;
    n|N ) echo "no -- quitting program"; exit;;
    * ) echo "Invalid response -- quitting program"; exit;;
esac
find $1 -type f -iname '*.cf[cm]' -exec /usr/bin/basename {} ';' >basename.txt
replaceAndRename () {
    lcfile="$(echo $(/usr/bin/basename "$1") | tr '[A-Z]' '[a-z]')"
    lcdir="$(/usr/bin/dirname "$1")"
    lcfile="${lcdir}/${lcfile}"
    /usr/bin/awk 'NR==FNR { map[$1]=tolower($1); next }{ 
        for (old in map) {
            gsub(old,map[old])
        }
        print
    }' basename.txt "$1" >"tmp.foo"
    retval="$?"
    if [ "$retval" -eq 0 ]; then
        rm "$1"
        mv tmp.foo "$lcfile"
    fi
}
export -f replaceAndRename
find $1 -type f -iname '*.cf[cm]' -exec bash -c 'replaceAndRename "$0"' {} \;
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.