Data Compression Using Long Common Strings
Created on 2023-06-30T15:55:18-05:00
Instead of a sliding window (as in LZW) a rolling hash (Rabin fingerprints) in conjunction with a lookup table are used.
A rolling hash is updated such that it is rolled over each byte of the input. Every so many bytes (controlled by B) the hash is checked against a dictionary of previously seen blocks for a match.
When a match is found: greedily expand the block backwards to include matching bytes, as well as forwards to consume matching bytes, but not so many as to step over an entire boundary area that should have been matched already.
Pairs well with gzip (LZW): Bentley-McIlroy removes long range duplications across the whole file, then gzip efficiently condenses repetitions within nearby regions.