Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only list text files (i.e., exclude binaries) #749

Open
NightMachinary opened this issue Mar 21, 2021 · 15 comments
Open

Only list text files (i.e., exclude binaries) #749

NightMachinary opened this issue Mar 21, 2021 · 15 comments

Comments

@NightMachinary
Copy link

@NightMachinary NightMachinary commented Mar 21, 2021

I want an option to only list text files (i.e., exclude binaries).

@tmccombs
Copy link
Collaborator

@tmccombs tmccombs commented Mar 23, 2021

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

@NightMachinary
Copy link
Author

@NightMachinary NightMachinary commented Mar 23, 2021

@tmccombs commented on Mar 23, 2021, 11:31 PM GMT+4:30:

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

The heuristic commonly employed is to scan for NUL characters. It can fail (e.g., when I was storing NUL-separated strings in a text file in git), but it's good enough.

@furkanusta
Copy link

@furkanusta furkanusta commented Mar 26, 2021

I needed this functionality recently to pipe to output to fzf (with previews enabled)
I used the "file" command to filter. It seems to perform multiple checks (magic numbers, encoding, and mime type I suppose)

fd -t f -x file | awk -F: '/ASCII text/ {print $1}'

Though it has a significant performance overhead

@sourlemon207
Copy link

@sourlemon207 sourlemon207 commented Apr 5, 2021

I needed this functionality recently to pipe to output to fzf (with previews enabled)
I used the "file" command to filter. It seems to perform multiple checks (magic numbers, encoding, and mime type I suppose)

fd -t f -x file | awk -F: '/ASCII text/ {print $1}'

Though it has a significant performance overhead

you might want to try using rg (ripgrep) instead such as
rg -lU '^[\x00-\x7F]*$'
I don't know if you can pipe a file list to rg but this approach is many times faster

@sharkdp
Copy link
Owner

@sharkdp sharkdp commented Aug 7, 2021

How do you define a text file? A file that only contains ASCII characters? Only valid UTF-8 sequences? It has a text/* MIME type? All of those would be kind of expensive to calculate. And determining the MIME type would be tricky to do in a portable way.

The heuristic commonly employed is to scan for NUL characters. It can fail (e.g., when I was storing NUL-separated strings in a text file in git), but it's good enough.

I wrote a blog post on this topic once 😄. I also implemented the NUL-heuristic in a Rust library. It's extremely lightweight, so we could actually consider adding text/binary filters for --type. But I'm not 100% sold on that idea. And I haven't thought through all of the implications this would have for combinations with other (--type) filters.

@sharkdp
Copy link
Owner

@sharkdp sharkdp commented Aug 8, 2021

I think this should be relatively easy to implement, if someone wants to try this.

But please note that I'm not 100% sure that we want to add this to fd. I'm inclined to say yes, though. Reasons for a rejection might be if this ends up being rather complicated to implement or if it conflicts with other fd features.

@sharkdp sharkdp changed the title [FR] Only list text files (i.e., exclude binaries) Only list text files (i.e., exclude binaries) Aug 8, 2021
@tsoutsman
Copy link
Contributor

@tsoutsman tsoutsman commented Aug 9, 2021

Hi, I'd like to give this a crack. I'm fairly new to Rust and open source in general, so I have a few questions:

  1. From @sharkdp s comment, I'd assume it should be implemented using content_inspector. Should the entire file be read in and checked or just the first n-number of bytes? I haven't worked with files or any programs where performance actually matters, so I genuinely have no clue about the performance implications.
  2. Is the default for the --type option all filetypes? The default for FileTypes seems to suggest that the default is nothing but running fd without any type parameters obviously prints out results.
  3. The current test setup doesn't seem to support file contents. If I were to try and add this feature, how should I go about testing it?

If this is implemented using the --type option, then there will be 2 more types, binary and text (obviously). Below is my proposed behaviour:

If the file option is added, it automatically overwrites the other two, and the results include both binary and text files regardless of whether or not they were enabled by the option. If only one of them is enabled, and file isn't enabled, then, obviously, it will display either binary or text based on the provided configuration. Finally, if both of them are enabled, the program behaves as if the file option was passed, regardless of whether the file option was actually given.

To implement this behaviour, I think the files field from FileTypes should be replaced with binaries and text. If files is passed to the program, it should just enable both fields.


I'm not that well versed in Rust, and definitely not familiar with this crate, so I quite possibly have missed something or made some incorrect assumptions.

@tmccombs
Copy link
Collaborator

@tmccombs tmccombs commented Aug 9, 2021

Should the entire file be read in and checked or just the first n-number of bytes

Probably the first n bytes, so that it isn't really slow if there are large files.

@sharkdp
Copy link
Owner

@sharkdp sharkdp commented Aug 9, 2021

From @sharkdp s comment, I'd assume it should be implemented using content_inspector. Should the entire file be read in and checked or just the first n-number of bytes? I haven't worked with files or any programs where performance actually matters, so I genuinely have no clue about the performance implications.

I agree with @tmccombs. We should probably do something similar to diff, reading the first 1024 bytes I think. And yes, this will nevertheless slow down a search significantly.

Is the default for the --type option all filetypes? The default for FileTypes seems to suggest that the default is nothing but running fd without any type parameters obviously prints out results.

Hm, yes. But note this section:

fd/src/options.rs

Lines 69 to 71 in 42dce35

/// The type of file to search for. If set to `None`, all file types are displayed. If
/// set to `Some(..)`, only the types that are specified are shown.
pub file_types: Option<FileTypes>,

The current test setup doesn't seem to support file contents. If I were to try and add this feature, how should I go about testing it?

I think we could open one of the generated test files and either write some text to it, or some "binary" content (i.e. something with \0).

If the file option is added, it automatically overwrites the other two, and the results include both binary and text files regardless of whether or not they were enabled by the option. If only one of them is enabled, and file isn't enabled, then, obviously, it will display either binary or text based on the provided configuration. Finally, if both of them are enabled, the program behaves as if the file option was passed, regardless of whether the file option was actually given.

Okay. And we definitely want to make sure to only run the binary-check if either binary or text is specified.

@sharkdp
Copy link
Owner

@sharkdp sharkdp commented Aug 9, 2021

Before you invest too much time into this, maybe we should reconsider if we REALLY want to integrate this in fd. @NightMachinary and @furkanusta could you tell us a bit more about your use case?

@tavianator
Copy link
Collaborator

@tavianator tavianator commented Aug 9, 2021

If you're okay with grep's binary file detection, you can do something like

$ fd -t f -X grep -lI .

which will print the names of all non-binary, non-empty files. I hoped that a zero-length pattern like ^ would allow it to match empty files but it stubbornly refuses. And #410 may be an issue.

@furkanusta
Copy link

@furkanusta furkanusta commented Aug 10, 2021

Initially, I had

export FZF_DEFAULT_OPTS='--preview "bat --style=numbers --color=always --line-range :500 {}"'
export FZF_DEFAULT_COMMAND='fd --type f'

in my config file and whenever I use fzf, I would get some binary files in the list (which were not a big problem but just an inconvenience). However, as I get comfortable with both fzf and fd I decided to skip the binary file check. Because most of the time I have a rough idea of what I am searching for and quickly filtering the files is better.
Maybe others have a different use cases, but nowadays I need this functionality rarely. (Maybe rather than adding this functionality, a way to do this manually can be added to wiki or README)

@sharkdp
Copy link
Owner

@sharkdp sharkdp commented Aug 10, 2021

@furkanusta If you are using (a recent version of) bat as a previewer, you should actually not have a problem with binary files. bat should show something like:

[bat warning]: Binary content from file 'rg' will not be printed to the terminal (but will be present if the output of 'bat' is piped). You can use 'bat -A' to show the binary file contents.

@furkanusta
Copy link

@furkanusta furkanusta commented Aug 10, 2021

I know, that is why I said it was more of an inconvenience rather than a problem. Since I am rarely interested in binary files I didn't want to see even the warning.
However, the delay from the type checking was creating a significant delay. That is why I stopped that binary check after a while and got used to the warning.

@1n40
Copy link

@1n40 1n40 commented Oct 5, 2021

How about we use file command to detect ascii files instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants