XDelta compression paper
Created on 2023-06-13T23:26:57-05:00
1. History
1a. RCS/SCSS storing files as a series of changes between versions.
2. XDFS
2a. Stores a linear serialized version of history, not an ancestry tree
like RCS/SCSS do.
2b. Forward deltas
2b1. Store a literal version and changes which apply after one another.
2c. Reverse deltas
2c1. Store the latest version as a literal, where changes are kept as
deltas from the latest version back to the old historical revision.
2d. Archive
2d1. Cluster
2d1a. Single literal version of a file (the whole thing)
2d1b. Connected set of deltas from the literal to a future version
3. Types of deltas
3a. Copy/Insert Deltas
3a1. Allows copying from anywhere already inside the source file--with
insertions where needed.
3a2. Emits COPY instructions for data to pull from the old version to a new
one, then INSERT instructions to add missing data.
3a3. More suitable for binary data.
3b. Insert/Delete Deltas
3b1. Based on checking what lines to insert and what lines to delete in a
file.
3b2. Uses longest common subsequences and compute edit scripts from one
version to another.
3b3. More suitable for human review.
4. Algorithms
4a. Greedy
4a1. Find longest available matches between files.
4a2. Impractical when implemented naively; but has optimal compression.
4b. Suffix-trees
4b1. Optimal outputs, but prohibitive storage costs make it unsuitable
for large files.
4b2. Used by bsdiff.
4c. Hashing
4c1. Uses a Rabin rolling hash of some window; the hashes are stored in a
table which is created by rolling the window over the entire target
file.
4c2. Source file can be scanned via windowing. When the rolling hash is
found in a table you know that the window appears in both files and
becomes somewhere you can now scan for similarities.
4d. XDelta
4d1. Designed to be fast while sacrificing optimal compression from
earlier algorithms.
4d2. Creates a hash table to map the hash of a window of data to a
position it was seen in the buffer. The map stores a single value
per bucket and collisions result in the older value being replaced.
4d3. Fingerprints are scanned from the end of a source file to the front,
so that the hash table ends up with the earliest instances of
spotted windows, since the author believes the earliest matches will
be the longest.
4d4. Once the fingerprint table is made: input is partitioned in to
"pages." Checks will not move across page boundaries--supposed to
help stream processing.
4d5. Once you find a candidate for a match: try to expand both ends as
far as possible. Do not walk over a page boundary though. Insert the
match as a COPY operation.
4d6. If no match is found then produce INSERT operations.
5. Tunables
5a. Rolling fingerprint hash is adler32
5b. "s," the size of a window, is 16 bytes
5c. "b," the bucket count of the fingerprint hash table. Has some prime
number size between floor(N/s) and 2*floor(N/s), where N is the number
of bytes in the file to be scanned.