Bash script that lowercases files

Question

I have the following bash script that:

Finds all files with .cfc and .cfm extension and converts them to lowercase
Stores the relative file paths to those files (filenames.txt)
Chops those files to get only the name, excluding the extension (files.txt)
Loops through 1500 files checking for references to any of the other 1500 files and converts those to lower case

#!/bin/bash
# Search for references to JS function is all .cfm and .cfc functions
# Prompt to make sure
while true; do
    read -p "All .cfc and .cfm files in the theradoc/ directory and lower directories will be converted. Do you wish to contiue? (y/n)" yn
case $yn in
    [Yy]* ) make install; break;;
    [Nn]* ) exit;;
    * ) echo "Please answer yes or no.";;
esac
done
echo "Renaming files..."
for f in `find theradoc/ -d -name '*.cfc'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
for f in `find theradoc/ -d -name '*.cfm'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
echo "Indexing file names..."
find theradoc/ -d -name '*.cfc' > filenames.txt
find theradoc/ -d -name '*.cfm' >> filenames.txt
echo "Editing file names..."
sed 's/theradoc.*\///g' filenames.txt > tmp.txt
sed 's/\.cf.*//g' tmp.txt > files.txt
rm tmp.txt
echo "Searching all files..."
a=($(wc filenames.txt))
lines=${a[0]}
count=0
while read fn; do
    echo "$fn | $count/$lines finished..."
    while read f; do
        perl -pi -e "s/$f/$f/gi" "$fn"
    done < files.txt
    count=$((count+1))
done < filenames.txt

Runtime: 4 hours

Hardware: MacBook 16GB RAM

I would definitely like to decrease the runtime of this as it may be needed to run on other systems with the same files.

It might be faster to construct one perl script instead of calling perl many times for each file. — choroba, 11 hours ago
I am willing to learn Perl if needed, as long as its not a complete, time consuming change. This project is winding down and I need to move on from it. Was just looking to improve this script before we ended. — theblindprophet, 11 hours ago

choroba · Answer 1 · 2016-07-12 21:37:31Z

In my testing, constructing one Perl script and running it repeatedly is much faster (0.5s versus 3.6s) then running a new Perl instance for each replacement:

while read f; do
    echo "s/$f/$f/gi;" 
done < files.txt > s.pl

while read fn; do
    perl -pi s.pl "$fn"
    echo "$fn | $count/$lines finished..."
    count=$((count+1))
done < filenames.txt

rm s.pl

But it seems even faster (0.05s) to rewrite the whole thing to Perl:

#! /usr/bin/perl
use warnings;
use strict;

use File::Find;

my $dir = 'theradoc2';

my %change;

find(sub {
    return unless -f;
    undef $change{$_};
    rename $_, lc $_;
}, $dir);

my $regex = join '|',
            map quotemeta,
            sort { length $b <=> length $a }
            keys %change;
find(sub {
    return unless -f;
    my $file = $_;
    open my $IN, '<', $file or die $!;
    open my $OUT, '>', "$file.new" or die $!;
    while (<$IN>) {
        s/($regex)/\L$1/g;
        print {$OUT} $_;
    }
    close $OUT or die $!;
    unlink $file or die $!;
    rename "$file.new", $file or die $!;
}, $dir);

forsvarir · Answer 2 · 2016-07-12 20:42:25Z

This might not make a huge difference, but I'd start by combining your find calls. It's been a while, but I think something like this will work (basically, match against *.cf followed by c or m):

find theradoc/ -d -name '*.cf[cm]' > filenames.txt

Or

find theradoc/ -d -name '*.cf?' > filenames.txt

You also don't seem to be using tmp.txt, so I'd tend to just pipe the output from the first sed call into the second:

sed 's/theradoc.*\///g' filenames.txt | sed 's/\.cf.*//g' > files.txt

Jerry Coffin · Answer 3 · 2016-07-12 21:30:49Z

At least as I understand it, you want to get the file names with relative paths. Right now, you're invocation of find produces absolute paths, then you're using sed to remove the prefix you specified on the command line.

Assuming that's correct, you can tell find to produce just the part you want by specifying -printf "%P\n".

For the part that remains (removing the file extensions) you can pipe the data directly from find to sed.

You also probably want to use -depth instead of -d, since the latter is deprecated.

Combining those, we end up with something like this:

find theradoc/ -depth -name *.cf[cm] -printf "%P\n" | sed s/\.cf.*//g > files.txt

I was going to suggest creating a sed script to handle the editing inside the index, but I see @choroba has already suggested roughly the same, so I'll leave that for now.

David Ongaro · Answer 4 · 2016-07-12 23:21:03Z

While the main problem of your script is obviously the while read fn; do loop (for which @choroba already provided a good solution I think) I want to point out a few problems which are often appearing in shell scripts:

for f in `find theradoc/ -d -name '*.cfc'`;

It's generally a bad Idea to collect the output of find in a subshell like this, because that might be memory intensive. Also looping over the result by for is discouraged.

There is also no need to employ tr to lowercase variable contents, since bash has a built-in way to do this (but it's probably one of the lesser known features). I also don't see why you need to employ 4 find calls. So in total I think you can replace the whole block

echo "Renaming files..."
for f in `find theradoc/ -d -name '*.cfc'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
for f in `find theradoc/ -d -name '*.cfm'`;
    do mv -v $f `echo $f | tr '[A-Z]' '[a-z]'`
done
echo "Indexing file names..."
find theradoc/ -d -name '*.cfc' > filenames.txt
find theradoc/ -d -name '*.cfm' >> filenames.txt
echo "Editing file names..."
sed 's/theradoc.*\///g' filenames.txt > tmp.txt
sed 's/\.cf.*//g' tmp.txt > files.txt
rm tmp.txt

with

echo "Renaming files..."
find theradoc/ -d -name '*.cf[cm]' | tee filenames.txt | while read f
do
    mv -v $f ${f,,}
done
echo "Editing file names..."
sed 's/theradoc.*\/\(.*\)\.cf/\1/' filenames.txt > files.txt

Of course this alone won't make a difference to your 4 hours runtime.

Edward · Answer 5 · 2016-07-13 03:48:15Z

Here are some things that may help improve your code.

Avoid the complication of using `make` if practical

Rather than invoking make install, I'd instead advocate something like this:

read -p "All .cfc and .cfm files in the $1 directory and lower directories will be converted.  Do you wish to continue?" affirm
case "$affirm" in 
    y|Y ) echo "yes";;
    n|N ) echo "no -- quitting program"; exit;;
    * ) echo "Invalid response -- quitting program"; exit;;
esac

The difference is that now the user doesn't need make. Reducing dependencies results in a more portable script.

Pass in the target directory as a parameter

Rather than hard code theradoc, it might be more convenient, especially for testing, to have the directory name as a command line argument.

Use full path names for security

It's probably better to invoke /usr/bin/sed than just sed because the latter would be easy to substitute. I could have a malicious sed on my path somewhere ahead of the real sed but it's generally harder to overwrite a system file. Specifying a full path uses that to your (and your user's) advantage.

Use `find` directly

Rather than creating multiple files and processing them, I think you will find that it's faster to simply use find directly rather than redirecting to a file and then processing that file line by line.

Use `awk` for complex replacement

While perl can do regular expression matching and everything awk can do, you're currently invoking perl once per filename per file. That is, if there are 1500 files, you're invoking perl 1500 * 1500 times = 2,250,000 times. It's already been suggested to just do the whole thing in perl which is certainly another option, although I find perl to be a "write only" language. Once I write it, six months later I find that even I can't read and understand it.

Construct just the replacement list

The only output file you really need is the one containing just the list of base file names. That can be done like this:

find $1 -type f -iname '*.cf[cm]' -exec /usr/bin/basename {} ';' >basename.txt

Note that this uses -exec to run basename to extract just the base part of the name (e.g. /usr/bin/basename would be converted to basename).

Create a `bash` function to handle the replacement

I'd advocate exporting a function and then invoking find again like this:

export -f replaceAndRename
find $1 -type f -iname '*.cf[cm]' -exec bash -c 'replaceAndRename "$0"' {} \;

The replaceAndRename function might be implemented like this with awk:

replaceAndRename () {
    lcfile="$(echo $(/usr/bin/basename "$1") | tr '[A-Z]' '[a-z]')"
    lcdir="$(/usr/bin/dirname "$1")"
    lcfile="${lcdir}/${lcfile}"
    /usr/bin/awk 'NR==FNR { map[$1]=tolower($1); next }{ 
        for (old in map) {
            gsub(old,map[old])
        }
        print
    }' basename.txt "$1" >"tmp.foo"
    retval="$?"
    if [ "$retval" -eq 0 ]; then
        rm "$1"
        mv tmp.foo "$lcfile"
    fi
}

This probably looks more complicated than it really is. The first three lines simply create a version of the name that uses a lowercase base name. Note, too that if there is something like a subdirectory with a name like My Directory we don't want to alter the directory name -- just the file name.

Next, we invoke awk and pass in the basenames.txt file created by the first invocation of find as well as the current file name to be processed, redirecting the output to a temporary file I've arbitrarily named tmp.foo but one could probably improve that by using mktemp instead.

The awk script reads the first file in and creates an associative array of the original version of the filename mapped to the lowercase version. Next, the second file is scanned and the map is used to make the appropriate substitutions. Finally the print within awk just prints the possibly modified line to the output which is redirected to tmp.foo.

Finally, if awk seemed to run successfully, we remove the original file and the move the tmp.foo into place using the lowercase version of the name. We do it in this order in case one of the files is already in lowercase but may have had file references changed.

Putting it all together

I don't have a convenient way to do benchmarking, but I believe this version of the script will improve your speed.

#!/bin/bash
read -p "All .cfc and .cfm files in the $1 directory and lower directories will be converted.  Do you wish to continue?" affirm
case "$affirm" in 
    y|Y ) echo "yes";;
    n|N ) echo "no -- quitting program"; exit;;
    * ) echo "Invalid response -- quitting program"; exit;;
esac
find $1 -type f -iname '*.cf[cm]' -exec /usr/bin/basename {} ';' >basename.txt
replaceAndRename () {
    lcfile="$(echo $(/usr/bin/basename "$1") | tr '[A-Z]' '[a-z]')"
    lcdir="$(/usr/bin/dirname "$1")"
    lcfile="${lcdir}/${lcfile}"
    /usr/bin/awk 'NR==FNR { map[$1]=tolower($1); next }{ 
        for (old in map) {
            gsub(old,map[old])
        }
        print
    }' basename.txt "$1" >"tmp.foo"
    retval="$?"
    if [ "$retval" -eq 0 ]; then
        rm "$1"
        mv tmp.foo "$lcfile"
    fi
}
export -f replaceAndRename
find $1 -type f -iname '*.cf[cm]' -exec bash -c 'replaceAndRename "$0"' {} \;

asked	today
viewed	243 times
active	today

current community

your communities

more stack exchange communities

Bash script that lowercases files

5 Answers 5

Avoid the complication of using `make` if practical

Pass in the target directory as a parameter

Use full path names for security

Use `find` directly

Use `awk` for complex replacement

Construct just the replacement list

Create a `bash` function to handle the replacement

Putting it all together

Your Answer

Not the answer you're looking for? Browse other questions tagged performance file-system bash or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Bash script that lowercases files

5 Answers 5

Avoid the complication of using make if practical

Pass in the target directory as a parameter

Use full path names for security

Use find directly

Use awk for complex replacement

Construct just the replacement list

Create a bash function to handle the replacement

Putting it all together

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged performance file-system bash or ask your own question.

Visit Chat

Related

Hot Network Questions

Avoid the complication of using `make` if practical

Use `find` directly

Use `awk` for complex replacement

Create a `bash` function to handle the replacement