Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using compression for files cache? #5657

Open
ThomasWaldmann opened this issue Jan 27, 2021 · 6 comments
Open

using compression for files cache? #5657

ThomasWaldmann opened this issue Jan 27, 2021 · 6 comments

Comments

@ThomasWaldmann
Copy link
Member

@ThomasWaldmann ThomasWaldmann commented Jan 27, 2021

the borg files cache can be rather large, because it keeps some information about all files that have been processed recently.

lz4 is a very fast compression / decompression algorithm, so we could try to use it to lower the in-memory footprint of the files cache entries.

before implementing this, we should check how big the savings typically are - to determine whether it is worthwhile doing that.

the files cache dictionary maps H(fullpath) --> msgpack(fileinfo).

the msgpack algorithm already lowers the storage requirements a bit by using e.g. integers only as long as necessary.
it also serializes the python data structure (which is not necessary as we use it now, but would be also necessary for compressing it).

with compression, it could work like H(fullpath) --> compress(msgpack(fileinfo)).

but we first need some statistics about the overall size of the files cache entries with and without compression.

because msgpacking is already removing some of the redundant information, it is unclear how much compressing its output can reduce the size. of course we need to compress the cache entries individually, so the amount of data per compression call is relatively low.

note: theoretically, we could also use other combinations of serialization algorithm and compression algorithm, if they give a better overall result (compressed size and decompression speed).

@jedie
Copy link
Contributor

@jedie jedie commented Jan 28, 2021

Whats about to split the full path?

@ThomasWaldmann
Copy link
Member Author

@ThomasWaldmann ThomasWaldmann commented Jan 28, 2021

Not sure what you mean...

@jedie
Copy link
Contributor

@jedie jedie commented Feb 2, 2021

I mean: Do you store the fullpath as a complete string? then it has many redundant information that can be stored as a tree...

@ThomasWaldmann
Copy link
Member Author

@ThomasWaldmann ThomasWaldmann commented Feb 2, 2021

No, I simplified a bit: it stores somehash(fullpath)

@nadalle
Copy link

@nadalle nadalle commented Oct 25, 2021

Just running gzip and xz on my files cache with borg 1.1.15:

files gzip: 203304634 -> 173423651 (-15%)
files xz: 203304634 -> 151031764 (-26%)

In comparison, the chunks file was much more compressible:

chunks gzip: 264346166 -> 139486190 (-47%)
chunks xz: 264346166 -> 132012032 (-50%)

@ThomasWaldmann
Copy link
Member Author

@ThomasWaldmann ThomasWaldmann commented Oct 26, 2021

@nadalle what i meant in top post:

  • the RAM requirement
  • for good speed, rather lz4
  • compressing each mapping value separately, not the whole mapping.

could be that the on-disk compressibility you determined is a upper bound to what I wanted to know, so it doesn't look like we should implement that.

@ThomasWaldmann ThomasWaldmann removed this from the hydrogen aka 1.2.0 milestone Jan 20, 2022
@ThomasWaldmann ThomasWaldmann added this to the 1.2.x milestone Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants