proposal for remotable mutable caches like the 1.x jvm backend

# Remotable Mutable Caching
I'm looking to see whether we could support an optimization to avoid snapshotting process execution output files/directories that we already know about. This would be useful if we wanted to start snapshotting mutable caches to make them remotable, without requiring any specific support in the remexec API.

## Background
### `digest_hint`
It looks like we don't set the `digest_hint` in `PathGlobsAndRoot` anywhere: https://github.com/pantsbuild/pants/blob/237e6a6e8b1c41d58a38426d92dac96f878f2159/src/python/pants/engine/fs.py#L182

It looks like we implemented that optimization in a one-off manner in several places in the 1.x jvm backend: https://github.com/pantsbuild/pants/blob/31cdbc84c9f9fc086050593ae1fad68e7d4c0cae/src/python/pants/backend/jvm/tasks/resolve_shared.py#L32-L41

### Symlink Optimization
See this comment on the mutable caches doc (https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit?disco=AAAAIva7gMw):
> I didn't realize we weren't snapshotting these caches. It seems like one way to avoid having any upstream support for this would be to implement something like this PR (#8905), which maintains symlinks to read-only materialized files.
>
> We wouldn't need to modify the Process struct to support this. Rather, we could implement the more general utility of "being able to provide symlinks to some snapshotted file or directory in local executions", then consume that in the exact way we already do, except that we would additionally then snapshot the caches at the end of the execution. This shouldn't introduce too much overhead, as we would only end up snapshotting any new entries. We can then introduce the same caches as part of the input digest to the remote execution request, and extract them from their known relative paths at the end of the request.

Luckily, we already do exactly this (write symlinks into the process execution dir) for local mutable caches. This proposal describes an extension of that idea which allows them to be remote-friendly without upstream support from BuildBarn/etc.

## Proposal: Snapshot our Mutable Caches
Because we have already successfully employed this method in the 1.x jvm backend, it seems reasonable to expect it will work again, especially now that we have a specific mutable cache concept. The idea I'm thinking of is:
- [ ] to automatically *write* a `digest_hint` file after snapshotting any directory **outside of the buildroot**.
  - We can do this recursively for each subdirectory, assuming as before we are using a global append-only cache.
- [ ] to automatically *check for* a `digest_hint` file before expanding globs into any directory **outside of the buildroot**.
- [ ] *after* a local process execution, to snapshot every mutable cache directory, and take advantage of every `digest_hint` file to avoid expanding globs into any existing directory.
  - This would **not** change anything that happens before the local process execution, as we already write symlinks to mutable cache directories.
- [ ] make the caches remotable:
  - *Before* a remote process execution, *merge* the digests for each mutable cache directory into the `input_digest` field of the remote execution proto.
  - *After* a remote process execution:
    - *extract* the mutable cache directory digests from the `output_digest` field.
      - If any cache digests are different than before, recursively traverse the digest and materialize all the new cache entries into the corresponding mutable cache directory on the local disk.
    - *subset* the `output_digest` field to remove the mutable cache directories.

### Possible Extensions
I could also see the above being useful outside of that context. In particular:
- Anywhere that pantsd needs to be restarted (e.g. in pantsd integration testing) we could write such digest hint files within source file directories to avoid having to trawl the entire pants repo again. This *might* reduce the time it takes to run such tests.
- As postulated in [the same comment above](https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit?disco=AAAAIva7gMw), it seems possible to avoid the complex and lengthy workaround surrounding the `exclusive_spawn` argument for local process executions by expanding the number of things we can write symlinks for:
> Looking again at the `exclusive_spawn` argument to `CommandRunner::run_in_workdir()`, it seems possible that the symlink optimization for materialized files could also be employed whenever we detect that argv[0] points inside the input_digest -- we could extract the file digest from the directory proto, materialize it into a canonical location (if this wasn't already done), then write it out as a symlink into the process execution sandbox before attempting to execute the process.
- MyPy uses non-append-only cache directories (see #10864). We could extend the approach above to solve this by simply snapshotting the entire cache directory after each local process execution (instead of relying on `digest_hint` files).

	# Capture Snapshots for jars, using an optional adjacent digest. Create the digest afterward
	# if it does not exist.
	snapshots = self.context._scheduler.capture_snapshots(
	tuple(
	PathGlobsAndRoot(PathGlobs([jar]), get_buildroot(), Digest.load(jar),)
	for jar in jar_paths
	)
	)
	for snapshot, jar_path in zip(snapshots, jar_paths):
	snapshot.digest.dump(jar_path)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal for remotable mutable caches like the 1.x jvm backend #10870

Remotable Mutable Caching

Background

`digest_hint`

Symlink Optimization

Proposal: Snapshot our Mutable Caches

Possible Extensions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

proposal for remotable mutable caches like the 1.x jvm backend #10870

Description

Remotable Mutable Caching

Background

digest_hint

Symlink Optimization

Proposal: Snapshot our Mutable Caches

Possible Extensions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`digest_hint`