Only list text files (i.e., exclude binaries) #749

NightMachinary · 2021-03-21T11:52:00Z

I want an option to only list text files (i.e., exclude binaries).

tmccombs · 2021-03-23T19:01:00Z

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

NightMachinary · 2021-03-23T20:22:55Z

@tmccombs commented on Mar 23, 2021, 11:31 PM GMT+4:30:

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

The heuristic commonly employed is to scan for NUL characters. It can fail (e.g., when I was storing NUL-separated strings in a text file in git), but it's good enough.

furkanusta · 2021-03-26T10:29:18Z

I needed this functionality recently to pipe to output to fzf (with previews enabled)
I used the "file" command to filter. It seems to perform multiple checks (magic numbers, encoding, and mime type I suppose)

fd -t f -x file | awk -F: '/ASCII text/ {print $1}'

Though it has a significant performance overhead

sourlemon207 · 2021-04-05T14:25:34Z

I needed this functionality recently to pipe to output to fzf (with previews enabled)
I used the "file" command to filter. It seems to perform multiple checks (magic numbers, encoding, and mime type I suppose)

fd -t f -x file | awk -F: '/ASCII text/ {print $1}'

Though it has a significant performance overhead

you might want to try using rg (ripgrep) instead such as
rg -lU '^[\x00-\x7F]*$'
I don't know if you can pipe a file list to rg but this approach is many times faster

sharkdp · 2021-08-07T21:15:24Z

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

The heuristic commonly employed is to scan for NUL characters. It can fail (e.g., when I was storing NUL-separated strings in a text file in git), but it's good enough.

I wrote a blog post on this topic once 😄. I also implemented the NUL-heuristic in a Rust library. It's extremely lightweight, so we could actually consider adding text/binary filters for --type. But I'm not 100% sold on that idea. And I haven't thought through all of the implications this would have for combinations with other (--type) filters.

sharkdp · 2021-08-08T21:20:28Z

I think this should be relatively easy to implement, if someone wants to try this.

But please note that I'm not 100% sure that we want to add this to fd. I'm inclined to say yes, though. Reasons for a rejection might be if this ends up being rather complicated to implement or if it conflicts with other fd features.

tsoutsman · 2021-08-09T12:47:52Z

Hi, I'd like to give this a crack. I'm fairly new to Rust and open source in general, so I have a few questions:

From @sharkdp s comment, I'd assume it should be implemented using content_inspector. Should the entire file be read in and checked or just the first n-number of bytes? I haven't worked with files or any programs where performance actually matters, so I genuinely have no clue about the performance implications.
Is the default for the --type option all filetypes? The default for FileTypes seems to suggest that the default is nothing but running fd without any type parameters obviously prints out results.
The current test setup doesn't seem to support file contents. If I were to try and add this feature, how should I go about testing it?

If this is implemented using the --type option, then there will be 2 more types, binary and text (obviously). Below is my proposed behaviour:

If the file option is added, it automatically overwrites the other two, and the results include both binary and text files regardless of whether or not they were enabled by the option. If only one of them is enabled, and file isn't enabled, then, obviously, it will display either binary or text based on the provided configuration. Finally, if both of them are enabled, the program behaves as if the file option was passed, regardless of whether the file option was actually given.

To implement this behaviour, I think the files field from FileTypes should be replaced with binaries and text. If files is passed to the program, it should just enable both fields.

I'm not that well versed in Rust, and definitely not familiar with this crate, so I quite possibly have missed something or made some incorrect assumptions.

tmccombs · 2021-08-09T19:02:27Z

Should the entire file be read in and checked or just the first n-number of bytes

Probably the first n bytes, so that it isn't really slow if there are large files.

sharkdp · 2021-08-09T20:32:56Z

From @sharkdp s comment, I'd assume it should be implemented using content_inspector. Should the entire file be read in and checked or just the first n-number of bytes? I haven't worked with files or any programs where performance actually matters, so I genuinely have no clue about the performance implications.

I agree with @tmccombs. We should probably do something similar to diff, reading the first 1024 bytes I think. And yes, this will nevertheless slow down a search significantly.

Is the default for the --type option all filetypes? The default for FileTypes seems to suggest that the default is nothing but running fd without any type parameters obviously prints out results.

Hm, yes. But note this section:

fd/src/options.rs

Lines 69 to 71 in 42dce35

    
           /// The type of file to search for. If set to `None`, all file types are displayed. If 
        
           /// set to `Some(..)`, only the types that are specified are shown. 
        
           pub file_types: Option<FileTypes>,

The current test setup doesn't seem to support file contents. If I were to try and add this feature, how should I go about testing it?

I think we could open one of the generated test files and either write some text to it, or some "binary" content (i.e. something with \0).

If the file option is added, it automatically overwrites the other two, and the results include both binary and text files regardless of whether or not they were enabled by the option. If only one of them is enabled, and file isn't enabled, then, obviously, it will display either binary or text based on the provided configuration. Finally, if both of them are enabled, the program behaves as if the file option was passed, regardless of whether the file option was actually given.

Okay. And we definitely want to make sure to only run the binary-check if either binary or text is specified.

sharkdp · 2021-08-09T20:34:46Z

Before you invest too much time into this, maybe we should reconsider if we REALLY want to integrate this in fd. @NightMachinary and @furkanusta could you tell us a bit more about your use case?

tavianator · 2021-08-09T20:49:31Z

If you're okay with grep's binary file detection, you can do something like

$ fd -t f -X grep -lI .

which will print the names of all non-binary, non-empty files. I hoped that a zero-length pattern like ^ would allow it to match empty files but it stubbornly refuses. And #410 may be an issue.

furkanusta · 2021-08-10T06:37:07Z

Initially, I had

export FZF_DEFAULT_OPTS='--preview "bat --style=numbers --color=always --line-range :500 {}"'
export FZF_DEFAULT_COMMAND='fd --type f'

in my config file and whenever I use fzf, I would get some binary files in the list (which were not a big problem but just an inconvenience). However, as I get comfortable with both fzf and fd I decided to skip the binary file check. Because most of the time I have a rough idea of what I am searching for and quickly filtering the files is better.
Maybe others have a different use cases, but nowadays I need this functionality rarely. (Maybe rather than adding this functionality, a way to do this manually can be added to wiki or README)

sharkdp · 2021-08-10T18:44:24Z

@furkanusta If you are using (a recent version of) bat as a previewer, you should actually not have a problem with binary files. bat should show something like:

[bat warning]: Binary content from file 'rg' will not be printed to the terminal (but will be present if the output of 'bat' is piped). You can use 'bat -A' to show the binary file contents.

furkanusta · 2021-08-10T21:39:09Z

I know, that is why I said it was more of an inconvenience rather than a problem. Since I am rarely interested in binary files I didn't want to see even the warning.
However, the delay from the type checking was creating a significant delay. That is why I stopped that binary check after a while and got used to the warning.

1n40 · 2021-10-05T06:03:07Z

How about we use file command to detect ascii files instead?

NightMachinary added the feature-request label Mar 21, 2021

sharkdp added the good first issue label Aug 8, 2021

sharkdp changed the title ~~[FR] Only list text files (i.e., exclude binaries)~~ Only list text files (i.e., exclude binaries) Aug 8, 2021

sharkdp / fd Public

Only list text files (i.e., exclude binaries) #749

Only list text files (i.e., exclude binaries) #749

NightMachinary commented Mar 21, 2021

tmccombs commented Mar 23, 2021

NightMachinary commented Mar 23, 2021

furkanusta commented Mar 26, 2021

sourlemon207 commented Apr 5, 2021

sharkdp commented Aug 7, 2021 •

edited

sharkdp commented Aug 8, 2021

tsoutsman commented Aug 9, 2021 •

edited

tmccombs commented Aug 9, 2021

sharkdp commented Aug 9, 2021

sharkdp commented Aug 9, 2021

tavianator commented Aug 9, 2021

furkanusta commented Aug 10, 2021

sharkdp commented Aug 10, 2021

furkanusta commented Aug 10, 2021 •

edited

1n40 commented Oct 5, 2021

sharkdp / fd Public

Only list text files (i.e., exclude binaries) #749

Only list text files (i.e., exclude binaries) #749

Comments

NightMachinary commented Mar 21, 2021

tmccombs commented Mar 23, 2021

NightMachinary commented Mar 23, 2021

furkanusta commented Mar 26, 2021

sourlemon207 commented Apr 5, 2021

sharkdp commented Aug 7, 2021 • edited

sharkdp commented Aug 8, 2021

tsoutsman commented Aug 9, 2021 • edited

tmccombs commented Aug 9, 2021

sharkdp commented Aug 9, 2021

sharkdp commented Aug 9, 2021

tavianator commented Aug 9, 2021

furkanusta commented Aug 10, 2021

sharkdp commented Aug 10, 2021

furkanusta commented Aug 10, 2021 • edited

1n40 commented Oct 5, 2021

sharkdp commented Aug 7, 2021 •

edited

tsoutsman commented Aug 9, 2021 •

edited

furkanusta commented Aug 10, 2021 •

edited