more-details-dedup

The following suggestion was sent to me by private e-mail (not sure if I can show the sender's name):

Date: Mon, 14 Oct 2013 11:48:41 +1100
Subject: Re: Obnam repo size with .sql dump files seem too big

Lars,

We implemented a strategy for identifying repeated chunks, even in
gzip-compressed files that have changes between versions. It might
work for obnam. The downside is it creates variable-sized chunks,
though you can set upper and lower limits.

Firstly, we identify chunk boundaries by using a rolling checksum
like Adler32, rolling over say a 128 byte contiguous region. Any
other efficient FIR (box-car) checksum algorithm would work. When
the bottom N bits of the checksum are zero, declare a block
boundary. This gives blocks of average size 2**N. If you want more
certainty, you can enforce a lower limit of 2**(N-3), and upper
limit of 2**N or 2**(N+1), for example (thereby either creating,
or ignoring, the boundaries that are defined by the checksum.

These variable-sized chunks will re-synchronise after differences
in two streams. To make it work with deflate, we flush the
compression context (the dictionary, we maintain the history,
losing 1-2% of compression) on each block boundary… but I'm not
sure that compression is necessary for obnam.

By choosing N appropriately, you get the block size you want. We
used N=12, for internet delivery (using HTTP subrange requests) of
updated files. For each file, we publish a catalog of Adler32 and
SHA-1 block checksums. Clients download that using HTTP, then
analyse their files before making requests for blocks they lack.

The invention of doing this in a deflate stream is due to Tim
Adam. When we discover we already have a given block, we can prime
the decompressor with that history before decompressing a received
update block. Really it should be implemented using 7zip instead
of deflate; a 32KB quotation history is too small.

I haven't tried this yet. It will require a new repository format version, so I've been working on adding that support instead. --liw