Checksum collisions and safety

Obnam is using the MD5 checksum algorithm for recognising duplicate data chunks. MD5 has a reputation for being unsafe: people have constructed files that are different, but result in the same MD5 checksum. This is true.

Every checksum algorithm can have collisions. Changing Obnam to, say, SHA1, SHA2, or the as yet unreleased SHA3 would not remove the chance of collisions. It would reduce the chance of accidental collisions, but the chance of those is already so small with MD5 that it can be disregarded. Or put in another way, if you care about the chance of accidental MD5 collisions, you should be caring about accidental SHA1, SHA2, or SHA3 collisions as well.

Apart from accidental collisions, there are two cases where you should worry about checksum collisions (regardless of algorithm).

First, if you're into researching checksum collisions, you're likely to have files that cause checksum collisions, and in that case, if you restore after a catastrophe, you probably want to get the files back intact, rather having Obnam confuse one with the other.

Second, if you have an enemy who wishes to corrupt your backed up data, they may replace some of the backed up data with other data that has the same checksum. This way, when you restore, your data is corrupted without Obnam noticing.

For both of these cases, you can instruct Obnam to verify that chunks of data with the same checksum actually are the same data, instead of relying on the checksum alone. This is as safe as it can be, but it has a big performance impact. It causes Obnam to have to read from the repository (possibly downloading it from your backup server) all the data you are backing up. You'll still benefit from the de-duplication, however, so your repository size will be smaller.