XDelta compression paper
Created on 2023-06-13T23:26:57-05:00
1. History 1a. RCS/SCSS storing files as a series of changes between versions. 2. XDFS 2a. Stores a linear serialized version of history, not an ancestry tree like RCS/SCSS do. 2b. Forward deltas 2b1. Store a literal version and changes which apply after one another. 2c. Reverse deltas 2c1. Store the latest version as a literal, where changes are kept as deltas from the latest version back to the old historical revision. 2d. Archive 2d1. Cluster 2d1a. Single literal version of a file (the whole thing) 2d1b. Connected set of deltas from the literal to a future version 3. Types of deltas 3a. Copy/Insert Deltas 3a1. Allows copying from anywhere already inside the source file--with insertions where needed. 3a2. Emits COPY instructions for data to pull from the old version to a new one, then INSERT instructions to add missing data. 3a3. More suitable for binary data. 3b. Insert/Delete Deltas 3b1. Based on checking what lines to insert and what lines to delete in a file. 3b2. Uses longest common subsequences and compute edit scripts from one version to another. 3b3. More suitable for human review. 4. Algorithms 4a. Greedy 4a1. Find longest available matches between files. 4a2. Impractical when implemented naively; but has optimal compression. 4b. Suffix-trees 4b1. Optimal outputs, but prohibitive storage costs make it unsuitable for large files. 4b2. Used by bsdiff. 4c. Hashing 4c1. Uses a Rabin rolling hash of some window; the hashes are stored in a table which is created by rolling the window over the entire target file. 4c2. Source file can be scanned via windowing. When the rolling hash is found in a table you know that the window appears in both files and becomes somewhere you can now scan for similarities. 4d. XDelta 4d1. Designed to be fast while sacrificing optimal compression from earlier algorithms. 4d2. Creates a hash table to map the hash of a window of data to a position it was seen in the buffer. The map stores a single value per bucket and collisions result in the older value being replaced. 4d3. Fingerprints are scanned from the end of a source file to the front, so that the hash table ends up with the earliest instances of spotted windows, since the author believes the earliest matches will be the longest. 4d4. Once the fingerprint table is made: input is partitioned in to "pages." Checks will not move across page boundaries--supposed to help stream processing. 4d5. Once you find a candidate for a match: try to expand both ends as far as possible. Do not walk over a page boundary though. Insert the match as a COPY operation. 4d6. If no match is found then produce INSERT operations. 5. Tunables 5a. Rolling fingerprint hash is adler32 5b. "s," the size of a window, is 16 bytes 5c. "b," the bucket count of the fingerprint hash table. Has some prime number size between floor(N/s) and 2*floor(N/s), where N is the number of bytes in the file to be scanned.