XDelta compression paper

Created on 2023-06-13T23:26:57-05:00
   1. History

     1a. RCS/SCSS storing files as a series of changes between versions.

   2. XDFS

     2a. Stores a linear serialized version of history, not an ancestry tree
         like RCS/SCSS do.

     2b. Forward deltas

       2b1. Store a literal version and changes which apply after one another.

     2c. Reverse deltas

       2c1. Store the latest version as a literal, where changes are kept as
            deltas from the latest version back to the old historical revision.

     2d. Archive

       2d1. Cluster

         2d1a. Single literal version of a file (the whole thing)

         2d1b. Connected set of deltas from the literal to a future version

   3. Types of deltas

     3a. Copy/Insert Deltas

       3a1. Allows copying from anywhere already inside the source file--with
            insertions where needed.

       3a2. Emits COPY instructions for data to pull from the old version to a new
            one, then INSERT instructions to add missing data.

       3a3. More suitable for binary data.

     3b. Insert/Delete Deltas

       3b1. Based on checking what lines to insert and what lines to delete in a
            file.

       3b2. Uses longest common subsequences and compute edit scripts from one
            version to another.

       3b3. More suitable for human review.

   4. Algorithms

     4a. Greedy

       4a1. Find longest available matches between files.

       4a2. Impractical when implemented naively; but has optimal compression.

     4b. Suffix-trees

       4b1. Optimal outputs, but prohibitive storage costs make it unsuitable
            for large files.

       4b2. Used by bsdiff.

     4c. Hashing

       4c1. Uses a Rabin rolling hash of some window; the hashes are stored in a
            table which is created by rolling the window over the entire target
            file.

       4c2. Source file can be scanned via windowing. When the rolling hash is
            found in a table you know that the window appears in both files and
            becomes somewhere you can now scan for similarities.

     4d. XDelta

       4d1. Designed to be fast while sacrificing optimal compression from
            earlier algorithms.

       4d2. Creates a hash table to map the hash of a window of data to a
            position it was seen in the buffer. The map stores a single value
            per bucket and collisions result in the older value being replaced.

       4d3. Fingerprints are scanned from the end of a source file to the front,
            so that the hash table ends up with the earliest instances of
            spotted windows, since the author believes the earliest matches will
            be the longest.

       4d4. Once the fingerprint table is made: input is partitioned in to
            "pages." Checks will not move across page boundaries--supposed to
            help stream processing.

       4d5. Once you find a candidate for a match: try to expand both ends as
            far as possible. Do not walk over a page boundary though. Insert the
            match as a COPY operation.

       4d6. If no match is found then produce INSERT operations.

   5. Tunables

     5a. Rolling fingerprint hash is adler32

     5b. "s," the size of a window, is 16 bytes

     5c. "b," the bucket count of the fingerprint hash table. Has some prime
         number size between floor(N/s) and 2*floor(N/s), where N is the number
         of bytes in the file to be scanned.