using compression for files cache? #5657

ThomasWaldmann · 2021-01-27T16:16:18Z

the borg files cache can be rather large, because it keeps some information about all files that have been processed recently.

lz4 is a very fast compression / decompression algorithm, so we could try to use it to lower the in-memory footprint of the files cache entries.

before implementing this, we should check how big the savings typically are - to determine whether it is worthwhile doing that.

the files cache dictionary maps H(fullpath) --> msgpack(fileinfo).

the msgpack algorithm already lowers the storage requirements a bit by using e.g. integers only as long as necessary.
it also serializes the python data structure (which is not necessary as we use it now, but would be also necessary for compressing it).

with compression, it could work like H(fullpath) --> compress(msgpack(fileinfo)).

but we first need some statistics about the overall size of the files cache entries with and without compression.

because msgpacking is already removing some of the redundant information, it is unclear how much compressing its output can reduce the size. of course we need to compress the cache entries individually, so the amount of data per compression call is relatively low.

note: theoretically, we could also use other combinations of serialization algorithm and compression algorithm, if they give a better overall result (compressed size and decompression speed).

The text was updated successfully, but these errors were encountered:

jedie · 2021-01-28T20:33:01Z

Whats about to split the full path?

ThomasWaldmann · 2021-01-28T22:43:36Z

Not sure what you mean...

jedie · 2021-02-02T17:41:41Z

I mean: Do you store the fullpath as a complete string? then it has many redundant information that can be stored as a tree...

ThomasWaldmann · 2021-02-02T18:14:00Z

No, I simplified a bit: it stores somehash(fullpath)

nadalle · 2021-10-25T21:16:28Z

Just running gzip and xz on my files cache with borg 1.1.15:

files gzip: 203304634 -> 173423651 (-15%)
files xz: 203304634 -> 151031764 (-26%)

In comparison, the chunks file was much more compressible:

chunks gzip: 264346166 -> 139486190 (-47%)
chunks xz: 264346166 -> 132012032 (-50%)

ThomasWaldmann · 2021-10-26T21:42:38Z

@nadalle what i meant in top post:

the RAM requirement
for good speed, rather lz4
compressing each mapping value separately, not the whole mapping.

could be that the on-disk compressibility you determined is a upper bound to what I wanted to know, so it doesn't look like we should implement that.

ThomasWaldmann added easy good first issue help wanted labels Jan 27, 2021

ThomasWaldmann added this to the hydrogen milestone Jan 27, 2021

ThomasWaldmann removed this from the hydrogen aka 1.2.0 milestone Jan 20, 2022

ThomasWaldmann added this to the 1.2.x milestone Jan 20, 2022

using compression for files cache? #5657

using compression for files cache? #5657

ThomasWaldmann commented Jan 27, 2021 •

edited

jedie commented Jan 28, 2021

ThomasWaldmann commented Jan 28, 2021

jedie commented Feb 2, 2021

ThomasWaldmann commented Feb 2, 2021

nadalle commented Oct 25, 2021

ThomasWaldmann commented Oct 26, 2021

using compression for files cache? #5657

using compression for files cache? #5657

Comments

ThomasWaldmann commented Jan 27, 2021 • edited

jedie commented Jan 28, 2021

ThomasWaldmann commented Jan 28, 2021

jedie commented Feb 2, 2021

ThomasWaldmann commented Feb 2, 2021

nadalle commented Oct 25, 2021

ThomasWaldmann commented Oct 26, 2021

ThomasWaldmann commented Jan 27, 2021 •

edited