New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Immutably hardlink "large" files in a sandbox #17520
Conversation
|
(FYI here was my little test) |
|
I also owe some fun perf numbers. Presumably |
|
OK, I added directory symlinking as well. Sizes are still arbitrary. One interesting note on the implementation as-is: big files will cause 2 symlinks. 1 for the directory but when the directory is being materialized, it also symlinks the file. It's not hard to fix (and gets cleaner once we can make this an implementation detail (e.g. remove Consider Anywho, thought that was worth a shoutout. Looking forward to getting some pointers on maybe micro-benchmarking the numbers, as well as the eventual cleanup after. |
|
(note to self) Force_mutable should only come from mutable_paths (for recursive mutability). We need two sets |
|
Comparing So when we aren't symlink we've introduced a smidge of overhead as checking the digest bytes is something. Otherwise the symlinks are very obviously fast. This doesn't measure the fact that the 1st run pays the cost of the readonly write, and only measures the "fast" Nth case.
|
|
To compare the 1st run (to see the increase when the directory is very first materialized): |
| let can_be_immutable = immutable_inputs.is_some() | ||
| && !mutable_path_ancestors.contains(&path) | ||
| && !force_mutable; | ||
|
|
||
| match child { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method body ended up pretty deeply nested. Maybe it would be worth breaking out a "materialize_child" method...? Unclear.
|
Oh yay, running Pants inside a Pants sandbox means we symlink the source, which then can't be globbed because they are absolute. @stuhood I'm wondering if we can make the directory check non-recursive. Would that still "fix" the node issue, loosely? Here's what I'm thinking:
I do think the "Javascript" use-case of this feature is a bit of hammering a nail with a screwdriver, and the real solution is a dedicated JS backend (where we can includelist dirs to symlink). However, I don't want to halt progress though. We could also try and deduce the relevant in-repo dirs and excludelist those. Sounds hard though, given the Process API just has a digest which already includes codegen and other plugin-provided stuff Alternatively, we can add a nuclear "just don't use immut inputs" option, and tell people they probably don't want it, but turn it on for our repo. Our repo is a bit special. |
The answer, I think, is for being able to specifically make |
|
I'd hesitate to expose what should be internal to Pants outside to the open world. Perhaps this case is more special, but I'd still hesitate. |
|
To be clear, the hook would be in Pants rule code, not necessarily available in the target definitions |
|
Oh I missed the |
|
FWIW if a |
|
OK scoped this down to just big files. Also had to includelist files that need to be re-materialized as not-symlinks so our integration tests won't be globbing absolute symlinks. |
|
Oh |
|
Hey look green CI! |
conftest.py
Outdated
| # These files are so big they exceed our "immutable symlink" hueristic. However, our integration | ||
| # tests run Pants in the sandbox, and Pants doesn't allow globbing absolute symlinks. | ||
| # Therefore we re-materialize them from their symlink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have this point to a ticket that discusses whether/how to expose an override for this behavior to users (only for test? in general?). I expect that we will encounter more issues like this during the alpha release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this issue really only matters in very niche cases, but I could be wrong. Here, we're running Pants in a Pants sandbox, and that's already a bit of a stretch
| # NB: -h follows symlinks, which ensures we don't capture a symlink to an immutable input, for instance | ||
| return (self.path, f"c{compression}fh", output_filename, *input_files) + files_from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky one. I would feel a little bit better about disabling the optimization in this case, because preserving symlinks (within node_modules, for example) is a huge benefit of your recent work to add symlink support. And tar'ing up a node_modules directory is likely to be common.
So... IMO, probably expose a "mutable paths" or "not symlinked" flag (as originally discussed on the ticket) on Process for this, and then the archive rules can enable/disable when they see tar (or by inspecting an abstract boolean property of the Archive base class)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this might be painful. So for sure we should remove -h then, but getting this PR to pass without that might be challenging. I'll try it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha unfortunately I can't test this locally anymore because the cached value has the symlink. CI will hold true though.
1�7)" This reverts commit f8df117. # Conflicts: # src/rust/engine/fs/store/src/tests.rs # src/rust/engine/process_execution/src/immutable_inputs.rs
Fixes #17282 and fixes #14070
This change represents the smallest footprint change to get support in for treating "large" files as immutable inputs.
immutable_inputs.rshas been moved tostore(to avoid circular reference)ImmutableInputsref and a list of paths to ensure mutableThe future is primed for changes like:
immutable_input_digeststo a process, and letting the heuristic take overProcessobject (e.g. we could includelist most/all PEXs since those shouldn't be mutated and we'd just have one top-level hardlink)Tested 3 ways:
./pants --keep-sandboxes=always <something>and inspected the sandbox between 2 different runs using the same daemon and ensured the hardlinkexperimental_shell_commandwith a file inoutputsthat matches a large file and ensured the file in the sandbox wasn't hardlinkedexperimental_shell_commandwith a dir inoutputsthat matches the containing dir of a large file and ensured the file in the sandbox wasn't hardlinked