Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immutably hardlink "large" files in a sandbox #17520

Merged
merged 38 commits into from Dec 6, 2022

Conversation

thejcannon
Copy link
Contributor

@thejcannon thejcannon commented Nov 9, 2022

Fixes #17282 and fixes #14070

This change represents the smallest footprint change to get support in for treating "large" files as immutable inputs.

  • immutable_inputs.rs has been moved to store (to avoid circular reference)
  • An additional method was added to support a hardlink file
  • Directory materialization takes an ImmutableInputs ref and a list of paths to ensure mutable
  • When materializing a file, if its above our threshold and not being forced mutable, we hardlink it to the immutable inputs
  • Process running seeds the mutable paths with the capture outputs

The future is primed for changes like:

  • Eventually removing the immutable_input_digests to a process, and letting the heuristic take over
  • And then cleaning the code up after that's ripped out
  • Adding more facilities to includelist/excludelist files from a Process object (e.g. we could includelist most/all PEXs since those shouldn't be mutated and we'd just have one top-level hardlink)
  • Have a directory huerstic
  • IDK more shenanigans 😄

Tested 3 ways:

  • ./pants --keep-sandboxes=always <something> and inspected the sandbox between 2 different runs using the same daemon and ensured the hardlink
  • Crafted an experimental_shell_command with a file in outputs that matches a large file and ensured the file in the sandbox wasn't hardlinked
  • Crafted an experimental_shell_command with a dir in outputs that matches the containing dir of a large file and ensured the file in the sandbox wasn't hardlinked

src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 9, 2022

(FYI here was my little test)

# src/python/pants/engine/internals/BUILD
experimental_shell_command(
    name="mutable_dir",
    command='echo "hi"',
    tools=["echo"],
    dependencies=[":native_engine"],
    outputs=["src/python/pants/engine/internals/native_engine.so"],
)

experimental_shell_command(
    name="mutable_file",
    command='echo "hi"',
    tools=["echo"],
    dependencies=[":native_engine"],
    outputs=["src/python/pants/engine/internals/"],
)

experimental_run_shell_command(
    name="foo",
    command='echo "hi"',
    dependencies=[":mutable_dir", ":mutable_file"],
)

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 9, 2022

I also owe some fun perf numbers. Presumably test will be lifted, as the engine will be symlinked at the least...

@thejcannon thejcannon changed the title WIP: Immutably symlink "large" files in a sandbox Immutably symlink "large" files in a sandbox Nov 10, 2022
@thejcannon thejcannon marked this pull request as ready for review Nov 10, 2022
@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 12, 2022

OK, I added directory symlinking as well. Sizes are still arbitrary.

One interesting note on the implementation as-is: big files will cause 2 symlinks. 1 for the directory but when the directory is being materialized, it also symlinks the file. It's not hard to fix (and gets cleaner once we can make this an implementation detail (e.g. remove immutable_digests), but I kinda think it might be worthwhile.

Consider requirements.pex containing torch.whl and some other deps. It's good we symlink the pex, in case it gets re-used, but ALSO we symlink torch so it's already written to it's own tempdir when a different requirements.pex containing torch.whl comes along.

Anywho, thought that was worth a shoutout. Looking forward to getting some pointers on maybe micro-benchmarking the numbers, as well as the eventual cleanup after.

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 12, 2022

(note to self)
There's a bug with dirs.

Force_mutable should only come from mutable_paths (for recursive mutability).
A second set should hold mutable_ancestors to just not immut an ancestor because some descendant is mutable.

We need two sets

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 14, 2022

Comparing main to this branch (comparing only Writeable because all external callers use Writeable, only ImmutableInputs use Readable).

materialize_directory/materialize_directory(Writable, 100, 100)                                                                            
                        time:   [8.4614 ms 9.0576 ms 9.7305 ms]
                        change: [-11.863% -0.4325% +11.966%] (p = 0.95 > 0.05)
                        No change in performance detected.
materialize_directory/materialize_directory(Writable, 20, 10000000)                                                                           
                        time:   [493.05 µs 499.85 µs 513.25 µs]
                        change: [-99.642% -99.598% -99.513%] (p = 0.00 < 0.05)
                        Performance has improved.
materialize_directory/materialize_directory(Writable, 1, 200000000)                                                                            
                        time:   [63.816 µs 65.047 µs 65.760 µs]
                        change: [-99.951% -99.944% -99.937%] (p = 0.00 < 0.05)
                        Performance has improved.
materialize_directory/materialize_directory(Writable, 10000, 100)                                                                          
                        time:   [1.1772 s 1.2574 s 1.3120 s]
                        change: [+0.8445% +7.5210% +15.315%] (p = 0.06 > 0.05)
                        No change in performance detected.

So when we aren't symlink we've introduced a smidge of overhead as checking the digest bytes is something. Otherwise the symlinks are very obviously fast. This doesn't measure the fact that the 1st run pays the cost of the readonly write, and only measures the "fast" Nth case.


main was:

materialize_directory/materialize_directory(Writable, 100, 100)                                                                            
                        time:   [8.3381 ms 10.027 ms 10.836 ms]
materialize_directory/materialize_directory(Writable, 20, 10000000)                                                                            
                        time:   [133.44 ms 140.14 ms 143.76 ms]
materialize_directory/materialize_directory(Writable, 1, 200000000)                                                                           
                        time:   [114.12 ms 132.72 ms 142.60 ms]
materialize_directory/materialize_directory(Writable, 10000, 100)                                                                          
                        time:   [1.1100 s 1.1479 s 1.1755 s]

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 14, 2022

To compare the 1st run (to see the increase when the directory is very first materialized):

materialize_directory/materialize_directory(Writable, 100, 100)                                                                            
                        time:   [9.3137 ms 10.185 ms 10.631 ms]
                        change: [-1.3595% +11.999% +28.115%] (p = 0.12 > 0.05)
                        No change in performance detected.
materialize_directory/materialize_directory(Writable, 20, 10000000)                                                                           
                        time:   [145.32 ms 147.38 ms 150.02 ms]
                        change: [-2.5926% -0.0611% +2.5870%] (p = 0.98 > 0.05)
                        No change in performance detected.
materialize_directory/materialize_directory(Writable, 1, 200000000)                                                                           
                        time:   [105.93 ms 109.19 ms 113.36 ms]
                        change: [-12.905% -2.4327% +9.7302%] (p = 0.69 > 0.05)
                        No change in performance detected.
materialize_directory/materialize_directory(Writable, 10000, 100)                                                                          
                        time:   [1.2358 s 1.3798 s 1.5742 s]
                        change: [+15.225% +32.607% +54.108%] (p = 0.00 < 0.05)
                        Performance has regressed

Copy link
Member

@stuhood stuhood left a comment

Thanks a lot!

It would be good to fix the recursive directory size re-calculations, but otherwise this looks great.

src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
let can_be_immutable = immutable_inputs.is_some()
&& !mutable_path_ancestors.contains(&path)
&& !force_mutable;

match child {
Copy link
Member

@stuhood stuhood Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method body ended up pretty deeply nested. Maybe it would be worth breaking out a "materialize_child" method...? Unclear.

src/rust/engine/process_execution/src/local.rs Outdated Show resolved Hide resolved
@thejcannon thejcannon enabled auto-merge (squash) Nov 22, 2022
@thejcannon thejcannon disabled auto-merge Nov 22, 2022
@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 22, 2022

Oh yay, running Pants inside a Pants sandbox means we symlink the source, which then can't be globbed because they are absolute.

@stuhood I'm wondering if we can make the directory check non-recursive. Would that still "fix" the node issue, loosely?

Here's what I'm thinking:

  • This really shouldn't be used for in-repo sources (as mentioned above. For test, we'd have a bunch of dir symlinks we'll probably not use more than once)
  • However, there's no way to know (easily) what's in-repo source vs what's from the plugin code vs what's in-repo-but-codegenerated
  • Therefore, I'd imagine most repos have tons of files, but almost never tons-in-one-dir. So this would be a really rough approximation

I do think the "Javascript" use-case of this feature is a bit of hammering a nail with a screwdriver, and the real solution is a dedicated JS backend (where we can includelist dirs to symlink). However, I don't want to halt progress though.


We could also try and deduce the relevant in-repo dirs and excludelist those. Sounds hard though, given the Process API just has a digest which already includes codegen and other plugin-provided stuff 🤔


Alternatively, we can add a nuclear "just don't use immut inputs" option, and tell people they probably don't want it, but turn it on for our repo. Our repo is a bit special.

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 22, 2022

I do think the "Javascript" use-case of this feature is a bit of hammering a nail with a screwdriver, and the real solution is a dedicated JS backend (where we can includelist dirs to symlink). However, I don't want to halt progress though.

The answer, I think, is for being able to specifically make node_modules directories that are set as outputs from experimental_shell_commands immutable. If there were a hook that we could wire up from the implementation of experimental_shell_command e.g. specifying a list of immutable output directories from a process, that would probably work well enough for us.

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 22, 2022

I'd hesitate to expose what should be internal to Pants outside to the open world. Perhaps this case is more special, but I'd still hesitate.

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 22, 2022

To be clear, the hook would be in Pants rule code, not necessarily available in the target definitions

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 22, 2022

Oh I missed the outputs part. That seems... very tricky. We'd have to "track" which sources come from codegenerators which is what I pointed out above (and wouldn't be special for experimental_shell_command).

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 22, 2022

FWIW if a Process did the digest merging itself from one-or-more-inputs (E.g. (strawman) source_digests v generated_digests v additional_digests) we could mark the source_digests as excludelisted.

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 22, 2022

OK scoped this down to just big files.

Also had to includelist files that need to be re-materialized as not-symlinks so our integration tests won't be globbing absolute symlinks.

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 23, 2022

Oh archive is interesting. It looks like we're trying to archive an immutable symlink, which I suppose the archive just captures the link. I'm guessing we should make it symlink oblivious (which is backwards compatible), and if you want symlink aware do it yourself.

@thejcannon
Copy link
Contributor Author

thejcannon commented Nov 23, 2022

Hey look green CI!

stuhood
stuhood approved these changes Dec 2, 2022
Copy link
Member

@stuhood stuhood left a comment

Thanks!

I think that the archive thing might actually need fixing, but other than that this looks ready to land.

conftest.py Outdated
# These files are so big they exceed our "immutable symlink" hueristic. However, our integration
# tests run Pants in the sandbox, and Pants doesn't allow globbing absolute symlinks.
# Therefore we re-materialize them from their symlink.
Copy link
Member

@stuhood stuhood Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have this point to a ticket that discusses whether/how to expose an override for this behavior to users (only for test? in general?). I expect that we will encounter more issues like this during the alpha release.

Copy link
Contributor Author

@thejcannon thejcannon Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this issue really only matters in very niche cases, but I could be wrong. Here, we're running Pants in a Pants sandbox, and that's already a bit of a stretch 😛 But again, I could be wrong...

# NB: -h follows symlinks, which ensures we don't capture a symlink to an immutable input, for instance
return (self.path, f"c{compression}fh", output_filename, *input_files) + files_from
Copy link
Member

@stuhood stuhood Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one. I would feel a little bit better about disabling the optimization in this case, because preserving symlinks (within node_modules, for example) is a huge benefit of your recent work to add symlink support. And tar'ing up a node_modules directory is likely to be common.

So... IMO, probably expose a "mutable paths" or "not symlinked" flag (as originally discussed on the ticket) on Process for this, and then the archive rules can enable/disable when they see tar (or by inspecting an abstract boolean property of the Archive base class)?

Copy link
Contributor Author

@thejcannon thejcannon Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this might be painful. So for sure we should remove -h then, but getting this PR to pass without that might be challenging. I'll try it next week.

Copy link
Contributor Author

@thejcannon thejcannon Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha unfortunately I can't test this locally anymore because the cached value has the symlink. CI will hold true though.

src/rust/engine/fs/store/benches/store.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
@thejcannon thejcannon changed the title Immutably symlink "large" files in a sandbox Immutably hardlink "large" files in a sandbox Dec 5, 2022
stuhood
stuhood approved these changes Dec 5, 2022
Copy link
Member

@stuhood stuhood left a comment

Huzzah. Thanks!

src/python/pants/core/util_rules/archive.py Outdated Show resolved Hide resolved
@thejcannon thejcannon merged commit f8df117 into pantsbuild:main Dec 6, 2022
15 checks passed
@thejcannon thejcannon deleted the bigfiles branch Dec 6, 2022
chrisjrn added a commit to chrisjrn/pants that referenced this pull request Dec 8, 2022
 1�7)"

This reverts commit f8df117.

# Conflicts:
#	src/rust/engine/fs/store/src/tests.rs
#	src/rust/engine/process_execution/src/immutable_inputs.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include/exclude and heuristic-based immutable_inputs Use immutable_inputs for PEXs
3 participants