I consistently see answers quoting this link stating definitively "Don't parse ls
!" This bothers me for a couple of reasons:
It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.
It also seems as if the problems stated in that link have sparked no desire to find a solution.
Here's the first paragraph:
The
ls(1)
command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. There are proposals to try and "fix" this within POSIX, but they won't help in dealing with the current situation (see also how to deal with filenames correctly). In its default mode, if standard output isn't a terminal, ls separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of ls that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls.
Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this websiite didn't do this kind of thing on a daily basis, I might think we were in some trouble. Now it's the next part of this article that really gets me though:
$ ls -l
total 8
-rw-r----- 1 lhunath lhunath 19 Mar 27 10:47 a
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a?newline
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a space
The problem is that from the output of
ls
, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.Also notice how
ls
sometimes garbles your filename data (in our case, it turned the\n
character in between the words "a" and "newline" into a ?question mark. Some systems put a\n
instead.). On some systems it doesn't do this when its output isn't a terminal, on others it always mangles the filename. All in all, you really can't and shouldn't trust the output ofls
to be a true representation of the filenames that you want to work with. So don't....
If you just want to iterate over all the files in the current directory, use a
for
loop and a glob:
for f in *; do
[[ -e $f ]] || continue
...
done
The author calls it garbling filenames when ls
returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!
Consider the following:
mkdir -p test && cd $_
printf 'touch ./"%b"\n' "file\nname" "f i l e n a m e" |. /dev/stdin
ls
###OUTPUT
f i l e n a m e file?name
IFS="
" ; for f in $(ls -1q) ; do [ -f "$f" ] && echo "./$f" ; done
###OUTPUT
./f i l e n a m e
./file
name
ls -1q | wc -l
###OUTPUT
2
unset IFS
for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done
###OUTPUT
./f i l e n a m e
./file
name
POSIX defines the -1
and -q
ls
operands so:
-q
- Force each instance of non-printable filename characters and<tab>
s to be written as the question-mark ('?'
) character. Implementations may provide this option by default if the output is to a terminal device.
-1
- (The numeric digit one.) Force output to be one entry per line.
Patrick suggests that this information is entirely accurate and that the author's apparent "yes glob|no glob" stance is not a contradiction. The fact is, if I have to use the shell to resolve a glob anyway, I would much rather have ls
generate that globbed data using any one of its many options - such as recursive searches, or specified sorts - than I would otherwise. Besides, ls
is fast. And I can resolve ls
globs without a loop, of course:
set -- $(ls -1q | tr " " "?")
echo "Arg count: $#"
printf %s\\n "$@"
###OUTPUT
Arg count: 2
f i l e n a m e
file
name
So Patrick has demonstrated the very real possibility that a glob will resolve more than one file. This is a real reason why maybe not. But it is easily handled. First I'll setup a real test set very like his own:
yes | rm *
printf %b $(printf \\%04o `seq 0 127`) |
tr -dc "[:lower:]" |
sed '/\([^ ]\)\([^ ]\)/s//touch "\1\n\2"\n \ntouch "\1\t\2"\n/g' |
. /dev/stdin
ls
##OUTPUT
a?b c?d e?f g?h i?j k?l m?n o?p q?r s?t u?v w?x y?z
a?b c?d e?f g?h i?j k?l m?n o?p q?r s?t u?v w?x y?z
Each of those is a letter then a tab or a newline then a letter.
Now watch:
ls -1iq |
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' |
tr -d '\n' |
xargs find
###OUTPUT
./a?b
./c?d
./c?d
./e?f
./e?f
...
About the shell globs though, I'll explain - because none of the answers do, though @terdon comes very close and he demonstrates the problem perfectly anyway.
So if you have this glob:
x?x
It matches all of these:
x\nx ; x\ x ; xax; xbx ; x[a-z0-9$anything]x
So if you have a glob - as ls
provides - and you have two files that match it, then you have twice the results. In the same way:
set -- * *
gets you twice the directory's contents in your shell array. Of course you can handle this. I thought to do so with uniq
- which works if I'm only handling identical globs. It does not work it if I'm handling, say:
x?x ; xax
This is the same problem we all inevitably encounter with the inherent greediness of any kind of wildcard matching system - regex or otherwise. At first I didn't fully consider this, for which you have my apologies. Still it's not hard to handle - there are two ways to go about this. You can either filter explicit results against any possible globbed result and thereby include only the globs or you can restrict the globs' matches. I did not go the former way, but, if you do, you'll need to ensure that if you have, say:
x?x ; xax
Then you include only:
x?x
This is, instead, what I have done:
eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
'/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
"s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)
I know, it looks fairly involved. But it really does follow only a few simple rules. Please allow me a moment to explain.
- In the first line I
tr
anslate spaces into question marks, because, as I noted in the POSIX specs above,<tab>
and<newline>
are already globbed with?
. Adding<space>
to that list accounts for all characters in the shell's default$IFS
value. - In that line I also query
ls
for a recursive listing of the current directory using the form././
. This ensures that my regexed search for the current directory in the next line will succeed and not include any false positives. I invoke
sed
and in the first line specify a search for a line containing the directory name for the current results, which, when found, is placed intosed
's hold space in the second line. When invoked with the operands I use abovels
will predictably and reliably display its results one per line like:././path next file file file ././path/next file file
Having already selected the pathname in the first
sed -e
statement, in the second I clean it up and place it inh
old space. I remove the added./
and append a/
if it is not already there. Having no further use for the line's contents, Is/.*//
remove it andb
ranch to script's end.With the third
sed -e
statement I firstd
elete any line not containing 1 or more characters, thenG
et my directory name from hold space and swap\2
for\1
the current line's file name and the latest directory name placed into hold space.With the last
sed -e
statement I prepare the line against the comingeval
. Ig
lobally replace any'
single quote the line contains with a\
backslash escaped version, I then enclose the entire&
line between two'
single quotes, and last I replace all?
question mark glob characters with the string["?$IFS"]
, which, wheneval
ed will glob against all<space>
,<tab>
,<newline>
, and<?question mark>
characters a file and/or pathname might contain.
After uniq
ensures we have only one glob as needed, when eval
finally does receive the results it looks something like this:
output from a sh -vx
run of the above:
eval set -- ''\''./y'\''["?$IFS"]'\''z'\''' ''\''./w'\''["?$IFS"]'\''x'\''' ''\''./u'\''["?$IFS"]'\''v'\''' ''\''./spa'\''["?$IFS"]'\''ces'\''' ''\''./s'\''["?$IFS"]'\''t'\''' ''\''./q'\''["?$IFS"]'\''r'\''' ''\''./o'\''["?$IFS"]'\''p'\''' ''\''./m'\''["?$IFS"]'\''n'\''' ''\''./k'\''["?$IFS"]'\''l'\''' ''\''./i'\''["?$IFS"]'\''j'\''' ''\''./g'\''["?$IFS"]'\''h'\''' ''\''./e'\''["?$IFS"]'\''f'\''' ''\''./crap'\''' ''\''./c'\''["?$IFS"]'\''d'\''' ''\''./acbb'\''' ''\''./aabb'\''' ''\''./aab'\''' ''\''./a'\''["?$IFS"]'\''b'\'''
set -- './y'["?$IFS"]'z' './w'["?$IFS"]'x' './u'["?$IFS"]'v' './spa'["?$IFS"]'ces' './s'["?$IFS"]'t' './q'["?$IFS"]'r' './o'["?$IFS"]'p' './m'["?$IFS"]'n' './k'["?$IFS"]'l' './i'["?$IFS"]'j' './g'["?$IFS"]'h' './e'["?$IFS"]'f' './crap' './c'["?$IFS"]'d' './acbb' './aabb' './aab' './a'["?$IFS"]'b'
When run its output appears thus:
./y z
./y
z
./w x
./w
x
./u v
./u
v
./spa ces
./s t
./s
t
And even:
yes | rm -rf *
printf 'touch "s o m e\ncrazy;'\''\?\\pa\tth n\\ame%d"\n' $(seq 10) |
. /dev/stdin
sh <<\CMD
eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
'/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
"s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)
printf %s\\n "ARG COUNT: $#" "$@"
CMD
###OUTPUT
ARG COUNT: 10
./s o m e
crazy;'\?\pa th n\ame9
./s o m e
crazy;'\?\pa th n\ame8
./s o m e
crazy;'\?\pa th n\ame7
./s o m e
crazy;'\?\pa th n\ame6
./s o m e
crazy;'\?\pa th n\ame5
./s o m e
crazy;'\?\pa th n\ame4
./s o m e
crazy;'\?\pa th n\ame3
./s o m e
crazy;'\?\pa th n\ame2
./s o m e
crazy;'\?\pa th n\ame10
./s o m e
crazy;'\?\pa th n\ame1
presents no difficulty. You can see that it is even following the -r
everse sort order operand given ls
in the command - you can use any of the ls
args compatible with the -1
and -q
operands.
But why do this thing? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls
output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.
It is also useful. I can get nearly any information I like that ls
can provide one fully resolved filename per positional parameter. I call that useful.
The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct - a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls
. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls
."
And last, here's a much simpler method of parsing ls
that I happen to use quite often when in need of inode numbers:
ls -1i | grep -o '^ *[0-9]*'
I expanded on the above a little:
IFS="
" ; ls -1iR ./ | sed -e '\|\.[^:]*:|{s/://;h;s/.*//;b}' \
-e '/^\( \{,4\}[0-9]\{3,\}\)/{
s//'\''\1\t/;G
s/\t\(.*\)\n\(.*\)/\t\2\/\1'\''/}' |
xargs printf %s\\n
###OUTPUT:
149989 ./zh_CN/man8/ faillog.8.gz
149990 ./zh_CN/man8/ groupadd.8.gz
149991 ./zh_CN/man8/ groupdel.8.gz
149992 ./zh_CN/man8/ groupmems.8.gz
149999 ./zh_CN/man8/ groupmod.8.gz
150000 ./zh_CN/man8/ grpck.8.gz
150001 ./zh_CN/man8/ grpconv.8.gz
150002 ./zh_CN/man8/ grpunconv.8.gz
150003 ./zh_CN/man8/ lastlog.8.gz
150004 ./zh_CN/man8/ newusers.8.gz
150005 ./zh_CN/man8/ pwck.8.gz
150006 ./zh_CN/man8/ pwconv.8.gz
150007 ./zh_CN/man8/ pwunconv.8.gz
150008 ./zh_CN/man8/ useradd.8.gz
150009 ./zh_CN/man8/ userdel.8.gz
150010 ./zh_CN/man8/ usermod.8.gz
148847 ./zh_TW/ man5
148848 ./zh_TW/ man8
150011 ./zh_TW/man5/ passwd.5.gz
150012 ./zh_TW/man8/ chpasswd.8.gz
150013 ./zh_TW/man8/ groupadd.8.gz
150014 ./zh_TW/man8/ groupdel.8.gz
150015 ./zh_TW/man8/ groupmod.8.gz
150016 ./zh_TW/man8/ useradd.8.gz
150017 ./zh_TW/man8/ userdel.8.gz
150018 ./zh_TW/man8/ usermod.8.gz
It's pretty fast actually. I think there's still a bug or two to work out of it, but not many.
There aren't going to be any duplicates in that result. That is listed by inode number - which is another handy POSIX specified option.
I still don't know why not to parse ls
output except that it can be a little tricky depending on the result you want.There have been a lot of good points made but none of them justify never using it. I don't understand that argument - though I have tried. I don't care about the points on this question - I'd happily give them away to anyone that answers it. All yours.
time bash -c 'for i in {1..1000}; do ls -R &>/dev/null; done'
= 3.18s vstime bash -c 'for i in {1..1000}; do echo **/* >/dev/null; done'
= 1.28s – Patrick May 12 at 4:05stat
in my answer, as it actually checks that each file exists. Your bit at the bottom with thesed
thing does not work. – Patrick May 12 at 4:20ls
in the first place? What you're describing is very hard. I'll need to deconstruct it to understand all of it and I'm a relatively competent user. You can't possibly expect your average Joe to be able to deal with something like this. – terdon May 12 at 4:40ls
output is wrong were covered well in the original link (and in plenty of other places). This question would have been reasonable if OP were asking for help understanding it, but instead OP is simply trying to prove his incorrect usage is ok. – R.. 2 days agoparsing ls is bad
. Doingfor something in $(command)
and relying on word-splitting to get accurate results is bad for the large majority ofcommand's
which don't have simple output. – BroSlow yesterday