The VCDIFF Generic Differencing and Compression Data Format
Created on 2023-06-13T06:37:50-05:00
1. Glossary
1a. Target File
1a1. The file you want to have.
1b. Source File
1b1. The file you have on hand already.
1c. Deltas
1c1. The changes you need to make to turn the source file in to the
target file.
2. Stated goals
2a. Output compactness
2a1. Provides a basic encoding format for dealing with patches.
2a2. Applications can add additional layers to get better compression if
needed.
2b. Data portability
2b1. Machine byte order and word size issues are worked around.
2b2. Base unit of measure is the 8-bit byte.
2c. Algorithm genericity
2c1. VCDiff only specifies a language to apply patch data; it leaves the
way you arrive at those changes undefined on purpose.
2d. Decoding efficiency
2d1. Uses only byte-aligned operations to avoid the need for bit
operations.
3. Integer encoding
3a. Variable length; each chunk is an 8-bit byte. Most significant bit
determines if another byte must be read to complete the integer. Values
are stored in the least significant 7 bits.
4. Windows
4a. There is a "source" and "target" window
4b. These windows are put together in a "superstring" called U.
4b1. The superstring is the equivalent of concatenating all bytes of the
source and target window together.
4c. Target window is initially blank when reconstructing a file--but is
appended to as delta instructions are followed.
5. Instructions
5a. Instructions apply within the context of a Window.
5b. Instructions are allowed to access indices which occur beyond the
source window. In that case data is being referenced from data that has
already been emitted to the target window. This is allowed as long as
the data has already been pushed to the target and you are only
referencing something you already injected or copied.
5c. ADD
5c1. Holds the number of bytes to be added, and the payload to be
injected directly.
5d. COPY
5d1. Holds the number of bytes to be copied from the source window, and
an offset to the window to copy from.
5e. RUN
5e1. As in, "run length encoding."
5e2. Holds a count and a byte. The byte is repeated `count` number of
times.
6. File layout
6a. There are exact byte specifications for how instructions should be
encoded in to the file. I am not providing those here.
6b. Header
6c. Windows
6c1. Targets a size and offset from a source file.
6c2. Contains the instruction set to run to perform the transformation.