The VCDIFF Generic Differencing and Compression Data Format
Created on 2023-06-13T06:37:50-05:00
1. Glossary 1a. Target File 1a1. The file you want to have. 1b. Source File 1b1. The file you have on hand already. 1c. Deltas 1c1. The changes you need to make to turn the source file in to the target file. 2. Stated goals 2a. Output compactness 2a1. Provides a basic encoding format for dealing with patches. 2a2. Applications can add additional layers to get better compression if needed. 2b. Data portability 2b1. Machine byte order and word size issues are worked around. 2b2. Base unit of measure is the 8-bit byte. 2c. Algorithm genericity 2c1. VCDiff only specifies a language to apply patch data; it leaves the way you arrive at those changes undefined on purpose. 2d. Decoding efficiency 2d1. Uses only byte-aligned operations to avoid the need for bit operations. 3. Integer encoding 3a. Variable length; each chunk is an 8-bit byte. Most significant bit determines if another byte must be read to complete the integer. Values are stored in the least significant 7 bits. 4. Windows 4a. There is a "source" and "target" window 4b. These windows are put together in a "superstring" called U. 4b1. The superstring is the equivalent of concatenating all bytes of the source and target window together. 4c. Target window is initially blank when reconstructing a file--but is appended to as delta instructions are followed. 5. Instructions 5a. Instructions apply within the context of a Window. 5b. Instructions are allowed to access indices which occur beyond the source window. In that case data is being referenced from data that has already been emitted to the target window. This is allowed as long as the data has already been pushed to the target and you are only referencing something you already injected or copied. 5c. ADD 5c1. Holds the number of bytes to be added, and the payload to be injected directly. 5d. COPY 5d1. Holds the number of bytes to be copied from the source window, and an offset to the window to copy from. 5e. RUN 5e1. As in, "run length encoding." 5e2. Holds a count and a byte. The byte is repeated `count` number of times. 6. File layout 6a. There are exact byte specifications for how instructions should be encoded in to the file. I am not providing those here. 6b. Header 6c. Windows 6c1. Targets a size and offset from a source file. 6c2. Contains the instruction set to run to perform the transformation.