Why not parse `ls`?

Question

I consistently see answers quoting this link stating definitively "Don't parse ls!" This bothers me for a couple of reasons:

It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.
It also seems as if the problems stated in that link have sparked no desire to find a solution.

Here's the first paragraph:

The ls(1) command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. There are proposals to try and "fix" this within POSIX, but they won't help in dealing with the current situation (see also how to deal with filenames correctly). In its default mode, if standard output isn't a terminal, ls separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of ls that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls.

Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this websiite didn't do this kind of thing on a daily basis, I might think we were in some trouble. Now it's the next part of this article that really gets me though:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our case, it turned the \n character in between the words "a" and "newline" into a ?question mark. Some systems put a \n instead.). On some systems it doesn't do this when its output isn't a terminal, on others it always mangles the filename. All in all, you really can't and shouldn't trust the output of ls to be a true representation of the filenames that you want to work with. So don't.

...

If you just want to iterate over all the files in the current directory, use a for loop and a glob:

for f in *; do
    [[ -e $f ]] || continue
    ...
done

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

Consider the following:

mkdir -p test && cd $_ 
printf 'touch ./"%b"\n' "file\nname" "f i l e n a m e" |. /dev/stdin
ls

###OUTPUT
f i l e n a m e  file?name

IFS="
" ; for f in $(ls -1q) ; do [ -f "$f" ] && echo "./$f" ; done

###OUTPUT
./f i l e n a m e
./file
name

ls -1q | wc -l

###OUTPUT
2

unset IFS
for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done

###OUTPUT
./f i l e n a m e
./file
name

POSIX defines the -1 and -q ls operands so:

-q - Force each instance of non-printable filename characters and <tab>s to be written as the question-mark ( '?' ) character. Implementations may provide this option by default if the output is to a terminal device.

-1 - (The numeric digit one.) Force output to be one entry per line.

Patrick suggests that this information is entirely accurate and that the author's apparent "yes glob|no glob" stance is not a contradiction. The fact is, if I have to use the shell to resolve a glob anyway, I would much rather have ls generate that globbed data using any one of its many options - such as recursive searches, or specified sorts - than I would otherwise. Besides, ls is fast. And I can resolve ls globs without a loop, of course:

set -- $(ls -1q | tr " " "?")
echo "Arg count: $#"
printf %s\\n "$@"

###OUTPUT
Arg count: 2
f i l e n a m e
file
name

So Patrick has demonstrated the very real possibility that a glob will resolve more than one file. This is a real reason why maybe not. But it is easily handled. First I'll setup a real test set very like his own:

yes | rm *
printf %b $(printf \\%04o `seq 0 127`) | 
tr -dc "[:lower:]" | 
sed '/\([^ ]\)\([^ ]\)/s//touch "\1\n\2"\n \ntouch "\1\t\2"\n/g' |
. /dev/stdin
ls

##OUTPUT
a?b  c?d  e?f  g?h  i?j  k?l  m?n  o?p  q?r  s?t  u?v  w?x  y?z
a?b  c?d  e?f  g?h  i?j  k?l  m?n  o?p  q?r  s?t  u?v  w?x  y?z

Each of those is a letter then a tab or a newline then a letter.

Now watch:

ls -1iq | 
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' | 
tr -d '\n' | 
xargs find

###OUTPUT
./a?b
./c?d
./c?d
./e?f
./e?f
...

About the shell globs though, I'll explain - because none of the answers do, though @terdon comes very close and he demonstrates the problem perfectly anyway.

So if you have this glob:

x?x

It matches all of these:

x\nx ; x\ x ; xax; xbx ; x[a-z0-9$anything]x

So if you have a glob - as ls provides - and you have two files that match it, then you have twice the results. In the same way:

set -- * *

gets you twice the directory's contents in your shell array. Of course you can handle this. I thought to do so with uniq - which works if I'm only handling identical globs. It does not work it if I'm handling, say:

x?x ; xax

This is the same problem we all inevitably encounter with the inherent greediness of any kind of wildcard matching system - regex or otherwise. At first I didn't fully consider this, for which you have my apologies. Still it's not hard to handle - there are two ways to go about this. You can either filter explicit results against any possible globbed result and thereby include only the globs or you can restrict the globs' matches. I did not go the former way, but, if you do, you'll need to ensure that if you have, say:

x?x ; xax

Then you include only:

x?x

This is, instead, what I have done:

eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
        's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
        '/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
        "s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)

I know, it looks fairly involved. But it really does follow only a few simple rules. Please allow me a moment to explain.

In the first line I translate spaces into question marks, because, as I noted in the POSIX specs above, <tab> and <newline> are already globbed with ?. Adding <space> to that list accounts for all characters in the shell's default $IFS value.
In that line I also query ls for a recursive listing of the current directory using the form ././. This ensures that my regexed search for the current directory in the next line will succeed and not include any false positives.
I invoke sed and in the first line specify a search for a line containing the directory name for the current results, which, when found, is placed into sed's hold space in the second line. When invoked with the operands I use above ls will predictably and reliably display its results one per line like:

././path next file file file ././path/next file file
Having already selected the pathname in the first sed -e statement, in the second I clean it up and place it in hold space. I remove the added ./ and append a / if it is not already there. Having no further use for the line's contents, I s/.*// remove it and branch to script's end.
With the third sed -e statement I first delete any line not containing 1 or more characters, then Get my directory name from hold space and swap \2 for \1 the current line's file name and the latest directory name placed into hold space.
With the last sed -e statement I prepare the line against the coming eval. I globally replace any ' single quote the line contains with a \backslash escaped version, I then enclose the entire &line between two 'single quotes, and last I replace all ?question mark glob characters with the string ["?$IFS"], which, when evaled will glob against all <space>, <tab>, <newline>, and <?question mark> characters a file and/or pathname might contain.

After uniq ensures we have only one glob as needed, when eval finally does receive the results it looks something like this:

output from a sh -vx run of the above:

eval set -- ''\''./y'\''["?$IFS"]'\''z'\''' ''\''./w'\''["?$IFS"]'\''x'\''' ''\''./u'\''["?$IFS"]'\''v'\''' ''\''./spa'\''["?$IFS"]'\''ces'\''' ''\''./s'\''["?$IFS"]'\''t'\''' ''\''./q'\''["?$IFS"]'\''r'\''' ''\''./o'\''["?$IFS"]'\''p'\''' ''\''./m'\''["?$IFS"]'\''n'\''' ''\''./k'\''["?$IFS"]'\''l'\''' ''\''./i'\''["?$IFS"]'\''j'\''' ''\''./g'\''["?$IFS"]'\''h'\''' ''\''./e'\''["?$IFS"]'\''f'\''' ''\''./crap'\''' ''\''./c'\''["?$IFS"]'\''d'\''' ''\''./acbb'\''' ''\''./aabb'\''' ''\''./aab'\''' ''\''./a'\''["?$IFS"]'\''b'\'''
set -- './y'["?$IFS"]'z' './w'["?$IFS"]'x' './u'["?$IFS"]'v' './spa'["?$IFS"]'ces' './s'["?$IFS"]'t' './q'["?$IFS"]'r' './o'["?$IFS"]'p' './m'["?$IFS"]'n' './k'["?$IFS"]'l' './i'["?$IFS"]'j' './g'["?$IFS"]'h' './e'["?$IFS"]'f' './crap' './c'["?$IFS"]'d' './acbb' './aabb' './aab' './a'["?$IFS"]'b'

When run its output appears thus:

./y     z
./y
z
./w     x
./w
x
./u     v
./u
v
./spa ces
./s     t
./s
t

And even:

yes | rm -rf *
printf 'touch "s o m e\ncrazy;'\''\?\\pa\tth n\\ame%d"\n' $(seq 10) | 
. /dev/stdin

sh <<\CMD
eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
    's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
    '/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
    "s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)
printf %s\\n "ARG COUNT: $#" "$@"

CMD

###OUTPUT

ARG COUNT: 10
./s o m e
crazy;'\?\pa    th n\ame9
./s o m e
crazy;'\?\pa    th n\ame8
./s o m e
crazy;'\?\pa    th n\ame7
./s o m e
crazy;'\?\pa    th n\ame6
./s o m e
crazy;'\?\pa    th n\ame5
./s o m e
crazy;'\?\pa    th n\ame4
./s o m e
crazy;'\?\pa    th n\ame3
./s o m e
crazy;'\?\pa    th n\ame2
./s o m e
crazy;'\?\pa    th n\ame10
./s o m e
crazy;'\?\pa    th n\ame1

presents no difficulty. You can see that it is even following the -reverse sort order operand given ls in the command - you can use any of the ls args compatible with the -1 and -q operands.

But why do this thing? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.

It is also useful. I can get nearly any information I like that ls can provide one fully resolved filename per positional parameter. I call that useful.

The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct - a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls."

And last, here's a much simpler method of parsing ls that I happen to use quite often when in need of inode numbers:

ls -1i | grep -o '^ *[0-9]*'

I expanded on the above a little:

IFS="
" ; ls -1iR ./ | sed -e '\|\.[^:]*:|{s/://;h;s/.*//;b}' \
-e '/^\( \{,4\}[0-9]\{3,\}\)/{
    s//'\''\1\t/;G
    s/\t\(.*\)\n\(.*\)/\t\2\/\1'\''/}' |
xargs printf %s\\n

###OUTPUT:

149989  ./zh_CN/man8/ faillog.8.gz
149990  ./zh_CN/man8/ groupadd.8.gz
149991  ./zh_CN/man8/ groupdel.8.gz
149992  ./zh_CN/man8/ groupmems.8.gz
149999  ./zh_CN/man8/ groupmod.8.gz
150000  ./zh_CN/man8/ grpck.8.gz
150001  ./zh_CN/man8/ grpconv.8.gz
150002  ./zh_CN/man8/ grpunconv.8.gz
150003  ./zh_CN/man8/ lastlog.8.gz
150004  ./zh_CN/man8/ newusers.8.gz
150005  ./zh_CN/man8/ pwck.8.gz
150006  ./zh_CN/man8/ pwconv.8.gz
150007  ./zh_CN/man8/ pwunconv.8.gz
150008  ./zh_CN/man8/ useradd.8.gz
150009  ./zh_CN/man8/ userdel.8.gz
150010  ./zh_CN/man8/ usermod.8.gz
148847  ./zh_TW/ man5
148848  ./zh_TW/ man8
150011  ./zh_TW/man5/ passwd.5.gz
150012  ./zh_TW/man8/ chpasswd.8.gz
150013  ./zh_TW/man8/ groupadd.8.gz
150014  ./zh_TW/man8/ groupdel.8.gz
150015  ./zh_TW/man8/ groupmod.8.gz
150016  ./zh_TW/man8/ useradd.8.gz
150017  ./zh_TW/man8/ userdel.8.gz
150018  ./zh_TW/man8/ usermod.8.gz

It's pretty fast actually. I think there's still a bug or two to work out of it, but not many.

There aren't going to be any duplicates in that result. That is listed by inode number - which is another handy POSIX specified option.

I still don't know why not to parse ls output except that it can be a little tricky depending on the result you want.There have been a lot of good points made but none of them justify never using it. I don't understand that argument - though I have tried. I don't care about the points on this question - I'd happily give them away to anyone that answers it. All yours.

@mikeserv Ok I did. Shell glob is 2.48 times faster. time bash -c 'for i in {1..1000}; do ls -R &>/dev/null; done' = 3.18s vs time bash -c 'for i in {1..1000}; do echo **/* >/dev/null; done' = 1.28s — Patrick, May 12 at 4:05
In regards to your most recent update, please stop relying on visual output as determining that your code works. Pass your output to an actual program and have the program try and perform an operation on the file. This is why I was using stat in my answer, as it actually checks that each file exists. Your bit at the bottom with the sed thing does not work. — Patrick, May 12 at 4:20
You can't be serious. How can jumping through all the hoops your question describes be easier or simpler or in any way better than simply not parsing ls in the first place? What you're describing is very hard. I'll need to deconstruct it to understand all of it and I'm a relatively competent user. You can't possibly expect your average Joe to be able to deal with something like this. — terdon, May 12 at 4:40
-1 for using a question to pick an argument. All of the reasons parsing ls output is wrong were covered well in the original link (and in plenty of other places). This question would have been reasonable if OP were asking for help understanding it, but instead OP is simply trying to prove his incorrect usage is ok. — R.., 2 days ago
@mikeserv It's not just that parsing ls is bad. Doing for something in $(command) and relying on word-splitting to get accurate results is bad for the large majority of command's which don't have simple output. — BroSlow, yesterday

Zack · Accepted Answer · 2014-05-15 02:54:21Z

I completely agree with the document you are criticizing and with the other answers that expand upon it. I would like to offer an additional, independent reason to avoid parsing the output of ls.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I would argue that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

import os, sys
for path, files, dirs in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(path, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, path, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(path, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for path, files, dirs in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' || f[-1] == '~': continue
        p = os.path.join(path, f)
        ino = os.lstat().st_ino
        filelist.append((f, path, ino))

filelist.sort(key = lambda x: x[0])
for f, path, ino in filelist: 
   sys.stdout.write("%d %s %s\n" % (ino, path, f))

No recursion, just nested for-loops. os.walk is doing some seriously heavy lifting behind the scenes, but you don't have to worry about it any more than you have to worry about how ls or find work internally. — Zack, 2 days ago
Technically, os.walk returns a generator object. Generators are Python's version of lazy lists. Every time the outer for-loop iterates, the generator is invoked and "yields" the contents of another subdirectory. Equivalent functionality in Perl is File::Find, if that helps. — Zack, 2 days ago
I should note that in various comments, mikeserv's primary reason for parsing ls was that he can do some additional preprocessing (like sorting or filtering with grep) before the traversal. This alternative does not currently do that. — Izkata, yesterday
@MilesRout Shell is not a good any kind of language. Even in situations that it should be good at (e.g. /etc/init.d scripts, which are what it was designed for, insomuch as it was designed at all) the obvious way to do ... everything ... will leave you with subtle bugs, the correct way to do everything is tedious, and you have to read character by fucking character, looking for things that aren't there, to notice the difference. And I don't even want to talk about how tiny the true portable subset is. — Zack, yesterday
@mikeserv I agreed with the other answers before I posted my answer. Your reaction is what made me modify it. — Zack, yesterday

lesmana · Answer 2 · 2014-05-12 05:47:05Z

That link is referenced a lot because the information is completely accurate, and it has been there for a very long time.

ls replaces non-printable characters with glob characters yes, but those characters aren't in the actual filename. Why does this matter? 2 reasons:

If you pass that filename to a program, that filename doesn't actually exist. It would have to expand the glob to get the real file name.
The file glob might match more than one file.

For example:

$ touch a$'\t'b
$ touch a$'\n'b
$ ls -1
a?b
a?b

Notice how we have 2 files which look exactly the same. How are you going to distinguish them if they both are represented as a?b?

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

There is a difference here. When you get a glob back, as shown, that glob might match more than one file. However when you iterate through the results matching a glob, you get back the exact file, not a glob.

For example:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Notice how the xxd output shows that $file contained the raw charaters \t and \n, not ?.

If you use ls, you get this instead:

for file in $(ls -1q); do printf '%s' "$file" | xxd; done
0000000: 613f 62                                  a?b
0000000: 613f 62                                  a?b

"I'm going to iterate anyway, why not use `ls`?"

Your example you gave doesn't actually work. It looks like it works, but it doesn't.

I'm referring to this:

 for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done

I've created a directory with a bunch of file names:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

When I run your code, I get this:

$ for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done
./a b
./a b

Where'd the rest of the files go?

Let's try this instead:

$ for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done 
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a??b’: No such file or directory
./a b
./a b
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a?b’: No such file or directory

Now lets use an actual glob:

$ for f in *; do stat --format='%n' "./$f"; done                        
./a b
./a  b
./a b
./a b
./a b
./a
b

With bash

The above example was with my normal shell, zsh. When I repeat the procedure with bash, I get another completely different set of results with your example:

Same set of files:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Radically different results with your code:

for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done 
./a b
./a b
./a b
./a b
./a
b
./a  b
./a b
./a b
./a b
./a b
./a b
./a b
./a
b
./a b
./a b
./a b
./a b
./a
b

With a shell glob, it works perfectly fine:

$ for f in *; do stat --format='%n' "./$f"; done  
./a b
./a  b
./a b
./a b
./a b
./a
b

The reason bash behaves this way goes back to one of the points I made at the beginning of the answer: "The file glob might match more than one file".

ls is returning the same glob (a?b) for several files, so each time we expand this glob, we get every single file that matches it.

How to recreate the list of files I was using:

touch 'a b' 'a  b' a$'\xe2\x80\x82'b a$'\xe2\x80\x83'b a$'\t'b a$'\n'b

The hex code ones are UTF-8 NBSP characters.

@mikeserv actually his solution doesn't return a glob. I just updated my answer to clarify that point. — Patrick, May 12 at 2:02
"Not the rest"? It's inconsistent behavior, and unexpected results, how is that not a reason? — Patrick, May 12 at 2:32
@mikeserv Did you not see my comment on your question? Shell globbing is 2.5 times faster than ls. I also requested that you test your code as it does not work. What does zsh have to do with any of this? — Patrick, May 12 at 4:29
+1 for perserverance. OP wants an argument, not an answer... — jasonwryan, May 12 at 4:31
@mikeserv No, it all still applies even to bash. Though I'm done with this question as you're not listening to what I'm saying. — Patrick, May 12 at 5:37

terdon · Answer 3 · 2014-05-12 05:34:44Z

up vote 19 down vote

Let's try and simplify a little:

$ touch a$'\n'b a$'\t'b 'a b'
$ ls
a b  a?b  a?b
$ IFS="
"
$ set -- $(ls -1q | uniq)
$ echo "Total files in shell array: $#"
Total files in shell array: 4

See? That's already wrong right there. There are 3 files but bash is reporting 4. This is because the set is being given the globs generated by ls which are expanded by the shell before being passed to set. Which is why you get:

$ for x ; do
>     printf 'File #%d: %s\n' $((i=$i+1)) "$x"
> done
File #1: a b
File #2: a b
File #3: a    b
File #4: a
b

Or, if you prefer:

$ printf ./%s\\0 "$@" |
> od -A n -c -w1 |
> sed -n '/ \{1,3\}/s///;H
> /\\0/{g;s///;s/\n//gp;s/.*//;h}'
./a b
./a b
./a\tb
./a\nb

The above was run on bash 4.2.45.

edited May 12 at 5:34

answered May 12 at 2:37

terdon
28.7k42373

I upvoted this. It's good to see your own code bite you. But just because I got it wrong doesn't mean it can't be done right. I showed you a very simple way to do it this morning with ls -1qRi | grep -o '^ *[0-9]*' - that's parsing ls output, man, and it's the fastest and best way of which I know to get a list of inode numbers. – mikeserv 2 days ago

8

@mikeserv: It could be done right, if you have the time and patience. But the fact is, it is inherently error-prone. You yourself got it wrong. while arguing about its merits! That's a huge strike against it, if even the one person fighting for it fails to do it correctly. And chances are, you'll probably spend still more time getting it wrong before you get it right. I dunno about you, but most people have better to do with their time than fiddle around for ages with the same line of code. – cHao 2 days ago

@cHao - i didnt argue its merits - i protested its propaganda. – mikeserv 2 days ago

4

@mikeserv: The arguments against it are well-founded and well-deserved. Even you have shown them to be true. – cHao 2 days ago

2

Guys, could you take it to chat or to the comments section of the OP please? I'd rather not get pulled into this again. – terdon 2 days ago

show 2 more comments

Wumpus Q. Wumbley · Answer 4 · 2014-05-12 16:01:03Z

The output of ls -q isn't a glob at all. It uses ? to mean "There is a character here that can't be displayed directly". Globs use ? to mean "Any character is allowed here".

Globs have other special characters (* and [] at least, and inside the [] pair there are more). None of those are escaped by ls -q.

$ touch x '[x]'
$ ls -1q
[x]
x

If you treat the ls -1q output there are a set of globs and expand them, not only will you get x twice, you'll miss [x] completely. As a glob, it doesn't match itself as a string.

ls -q is meant to save your eyes and/or terminal from crazy characters, not to produce something that you can feed back to the shell.

nneonneo · Answer 5 · 2014-05-13 18:28:20Z

up vote 17 down vote

The answer is simple: the special cases of ls you have to handle outweigh any possible use. These special cases can be averted if you don't parse ls output.

The mantra here is never trust the user filesystem (the equivalent to never trust user input). If there's a method that will work always, being 100% certain, it should be the method you prefer even if ls do the same but with less certainly. I won't go into technical details since those were covered by terdon and Patrick extensively, but I know that the risks of using ls in an important (and maybe expensive) transaction where my neck (job or prestige) is out there, I will prefer any solution that doesn't have a grade of uncertainly if it can be adverted.

I know people will prefer some risk over certainly, but I've filled a bug report.

edited yesterday

nneonneo
1857

answered 2 days ago

Braiam
6,37621036

This is plain wrong. Why would you hamstring yourself and throw out a perfectly good tool just because? Patrick's answer is some 75% a result of him using an incompatible shell. I mean you have those same problems with zsh when you try to for f in anything if you don't set parameters correctly. You keep saying never that is misinformation. – mikeserv 2 days ago

1

@mikeserv wrong in what level? Technically it isn't, since I don't include any technically fact in my answer. My answer is something more fundamental, when you have 10 seconds and one try to get the job correctly done, would you use ls or something that is better match for the situation/circumstances? – Braiam 2 days ago

Come on, @Braiam - don't put me on the bomb squad. Not if you value your life. I didn't ask what I should do with 10 seconds time, I asked why I should never parse ls output... And I don't know what I'd do in 10 seconds, but my first guess is fail. – mikeserv 2 days ago

3

@mikeserv "Patrick's answer is some 75% a result of him using an incompatible shell". You really don't pay attention do you? My answer clearly outlines what is a result of bash, and what is a result of zsh. Are you deliberately trying to be obtuse? – Patrick 2 days ago

2

+1 for the xkcd link: it perfectly articulates what is going on in this trainwreck of a "question"... – jasonwryan yesterday

show 1 more comment

Voo · Answer 6 · 2014-05-14 15:17:26Z

The reason people say never do something isn't necessarily because it absolutely positively cannot be done correctly. We may be able to do so, but it may be more complicated, less efficient both space- or time-wise. For example it would be perfectly fine to say "Never build a large e-commerce backend in x86 assembly".

So now to the issue at hand: As you've demonstrated you can create a solution that parses ls and gives the right result - so correctness isn't an issue.

Is it more complicated? Yes, but we can hide that behind a helper function.

So now to efficiency:

Space-efficiency: Your solution relies on uniq to filter out duplicates, consequently we cannot generate the results lazily. So either O(1) vs. O(n) or both have O(n).

Time-efficiency: Best case uniq uses a hashmap approach so we still have a O(n) algorithm in the number of elements procured, probably though it's O(n log n).

Now the real problem: While your algorithm is still not looking too bad I was really careful to use elements procured and not elements for n. Because that does make a big difference. Say you have a file \n\n that will result in a glob for ?? so match every 2 character file in the listing. Funnily if you have another file \n\r that will also result in ?? and also return all 2 character files.. see where this is going? Exponential instead of linear behavior certainly qualifies as "worse runtime behavior".. it's the difference between a practical algorithm and one you write papers in theoretical CS journals about.

Everybody loves examples right? Here we go. Make a folder called "test" and use this python script in the same directory where the folder is.

#!/usr/bin/env python3
import itertools
dir = "test/"
filename_length = 3
options = "\a\b\t\n\v\f\r"

for filename in itertools.product(options, repeat=filename_length):
        open(dir + ''.join(filename), "a").close()

Only thing this does is generate all products of length 3 for 7 characters. High school math tells us that ought to be 343 files. Well that ought to be really quick to print, so let's see:

time for f in *; do stat --format='%n' "./$f" >/dev/null; done
real    0m0.508s
user    0m0.051s
sys 0m0.480s

Now let's try your first solution, because I really can't get this

eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
        's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
        '/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
        "s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)

thing here to work on Linux mint 16 (which I think speaks volumes for the usability of this method).

Anyhow since the above pretty much only filters the result after it gets it, the earlier solution should be at least as quick as the later (no inode tricks in that one- but those are unreliable so you'd give up correctness).

So now how long does

time for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f" >/dev/null; done

take? Well I really don't know, it takes a while to check 343^343 file names - I'll tell you after the heat death of the universe.

+1, mostly for "The reason people say never do something isn't necessarily because it absolutely positively cannot be done correctly." — shelleybutterfly, 1 hour ago

shelleybutterfly · Answer 7 · 2014-05-15 05:01:27Z

In this case, "never" is an idiomatic use of the word to represent the statement of a general rule; not a statement in modal logic, which, depending on the meaning of "should" in this case may be debatable.

A general rule, however, isn't a statement in modal logic, and frequently general rules can be broken under certain conditions. Instead, it's a way of saying that, unless you have the ability to tell that this statement is incorrect in a particular case, and prove to yourself that you are right, then you should follow the rule.

It's not just that you have to be very good with shell scripting to know whether it can be broken in a particular case. It's also that telling you got it wrong when you try to break it takes just as much. The likely audience of the articles saying such things can't, and those that do have such skill likely can figure it out on their own.

And I think, especially in this case, that is why "never" is decidedly the right way to phrase it.

asked	3 days ago
viewed	4967 times
active	today

current community

your communities

more stack exchange communities

Why not parse `ls`?

This question has an open bounty worth +150 reputation from mikeserv ending in 6 days.

7 Answers

"I'm going to iterate anyway, why not use `ls`?"

With bash

Your Answer

Not the answer you're looking for? Browse other questions tagged shell ls or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Why *not* parse `ls`?

This question has an open bounty worth +150 reputation from mikeserv ending in 6 days.

7 Answers

"I'm going to iterate anyway, why not use ls?"

With bash

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged shell ls or ask your own question.

Related

Hot Network Questions

Why not parse `ls`?

"I'm going to iterate anyway, why not use `ls`?"