Duplicacy: A New Generation of Cloud Backup Tool Based on Lock-Free Deduplication
Created on 2021-12-09T23:42:40-06:00
Theory
Data to be stored is first run through a chunking algorithm (rsync, rabin fingerprint) to split in to content-aware chunks.
A hash of the chunk is used to identify the chunk.
The chunk is moved to block storage named by its own content hash.
A "file" is an orderred list of chunks needed to reassemble itself.
A "manifest" is a list of files active in that particular version of the archive.
Uploading
Deduplication is done by seeing if a chunk with the given hash already exists; if so then skip uploading this chunk.
Pack and split
Duplicacy treats a backup as a giant tarball file. It walks the filesystem in alphabetical order and packs each file to the tarball which is then fed to the chunking system.
The manifest of chunks to re-create this tarball are themselves chunked, hashed and stored.
Pack and split is done to avoid creating too many chunks on the block store.
Garbage collection
Mark
Reassemble all manifest files.
Separate wheat and chaff.
Place all known chunks in to the chaff set.
For all chunks belonging to a manifest in the wheat set remove it from the chaff set.
Move chunks still remaining in the chaff set somewhere else (these are now called fossils.)
Sweep
Delete all chunks which are fossils.
The paper also recommends not running a sweep until each backup client has registered a manifest which was made after the last mark cycle. This is because a client is allowed to reference a fossil and promote it back to the object store without having to upload the contents again.