Problem: If chunk size is reasonably large (say, a megabyte), then most files will be smaller, and the repository ends up with a large number of identical files.

Idea: collect chunks into groups, called "salsa tins".

  • salsa tin = list of chunks
  • salsa tin has an id
  • chunk id = salsa tin id + suitable number of extra bits for index into list
  • chunk id may be 64 bits total, or 64+32, or whatever seems convenient
  • no chunk gets stored alone, only in salsa tins

This lets a client put things into the repository at will, without synchronisation or locking beyond what the filesystem provides (exclusive creation of files).


Having multiple chunks in a single file complicates the logic for managing files in the repository, and deleting unused chunks.

Therefore, an alternative idea: instead of shoving multiple chunks into one file, allow files to use parts of chunks. Currently a file's metadata lists the chunks that have its contents. Change this to be a list of (chunk id, offset, length) triplets, where offset and length specify a part of a chunk. This way, a client can create one chunk that contains the data of many small files, and they can all just use the relevant part of the chunk. Managing removal of those files is easy: it is the current code without modification.

--liw

This is implemented in git for FORMAT GREEN ALBATROSS. done --liw