Skip to content

proposal for remotable mutable caches like the 1.x jvm backend #10870

@cosmicexplorer

Description

@cosmicexplorer

Remotable Mutable Caching

I'm looking to see whether we could support an optimization to avoid snapshotting process execution output files/directories that we already know about. This would be useful if we wanted to start snapshotting mutable caches to make them remotable, without requiring any specific support in the remexec API.

Background

digest_hint

It looks like we don't set the digest_hint in PathGlobsAndRoot anywhere:

digest_hint: Optional[Digest] = None

It looks like we implemented that optimization in a one-off manner in several places in the 1.x jvm backend:

# Capture Snapshots for jars, using an optional adjacent digest. Create the digest afterward
# if it does not exist.
snapshots = self.context._scheduler.capture_snapshots(
tuple(
PathGlobsAndRoot(PathGlobs([jar]), get_buildroot(), Digest.load(jar),)
for jar in jar_paths
)
)
for snapshot, jar_path in zip(snapshots, jar_paths):
snapshot.digest.dump(jar_path)

Symlink Optimization

See this comment on the mutable caches doc (https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit?disco=AAAAIva7gMw):

I didn't realize we weren't snapshotting these caches. It seems like one way to avoid having any upstream support for this would be to implement something like this PR (#8905), which maintains symlinks to read-only materialized files.

We wouldn't need to modify the Process struct to support this. Rather, we could implement the more general utility of "being able to provide symlinks to some snapshotted file or directory in local executions", then consume that in the exact way we already do, except that we would additionally then snapshot the caches at the end of the execution. This shouldn't introduce too much overhead, as we would only end up snapshotting any new entries. We can then introduce the same caches as part of the input digest to the remote execution request, and extract them from their known relative paths at the end of the request.

Luckily, we already do exactly this (write symlinks into the process execution dir) for local mutable caches. This proposal describes an extension of that idea which allows them to be remote-friendly without upstream support from BuildBarn/etc.

Proposal: Snapshot our Mutable Caches

Because we have already successfully employed this method in the 1.x jvm backend, it seems reasonable to expect it will work again, especially now that we have a specific mutable cache concept. The idea I'm thinking of is:

  • to automatically write a digest_hint file after snapshotting any directory outside of the buildroot.
    • We can do this recursively for each subdirectory, assuming as before we are using a global append-only cache.
  • to automatically check for a digest_hint file before expanding globs into any directory outside of the buildroot.
  • after a local process execution, to snapshot every mutable cache directory, and take advantage of every digest_hint file to avoid expanding globs into any existing directory.
    • This would not change anything that happens before the local process execution, as we already write symlinks to mutable cache directories.
  • make the caches remotable:
    • Before a remote process execution, merge the digests for each mutable cache directory into the input_digest field of the remote execution proto.
    • After a remote process execution:
      • extract the mutable cache directory digests from the output_digest field.
        • If any cache digests are different than before, recursively traverse the digest and materialize all the new cache entries into the corresponding mutable cache directory on the local disk.
      • subset the output_digest field to remove the mutable cache directories.

Possible Extensions

I could also see the above being useful outside of that context. In particular:

  • Anywhere that pantsd needs to be restarted (e.g. in pantsd integration testing) we could write such digest hint files within source file directories to avoid having to trawl the entire pants repo again. This might reduce the time it takes to run such tests.
  • As postulated in the same comment above, it seems possible to avoid the complex and lengthy workaround surrounding the exclusive_spawn argument for local process executions by expanding the number of things we can write symlinks for:

Looking again at the exclusive_spawn argument to CommandRunner::run_in_workdir(), it seems possible that the symlink optimization for materialized files could also be employed whenever we detect that argv[0] points inside the input_digest -- we could extract the file digest from the directory proto, materialize it into a canonical location (if this wasn't already done), then write it out as a symlink into the process execution sandbox before attempting to execute the process.

  • MyPy uses non-append-only cache directories (see Improve MyPy performance #10864). We could extend the approach above to solve this by simply snapshotting the entire cache directory after each local process execution (instead of relying on digest_hint files).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions