Tunstall Coding

Created on 2023-05-20T21:38:05-05:00

Supposedly is worse than Huffman Trees for general data purposes; is mostly useful when the total range of symbols that can be in the input is narrow.

The number of bits required to store any sequence is ceil(log2(number of words)).

So if the table has 17 entries in total the address requires five bits.

Preparing the Tunstall Trie

Create a tree that holds all symbols of an alphabet that can be in an input document. Calculate the probability of each symbol appearing in the training set.

Pick the node with the highest probability of ocurring. This becomes a new tree, and the algorithm repeats.

The subtrees look for the number of symbols that are prefixed with the parent nodes. So if "A" is the highest probable node, everything under it is "AB" or "AC" and so on.

Repeats until you hit your depth limit.

Flatten the trie so you have a set of variable length data mapped to some (fixed-size) numeric key.

Encoding the Data

Perform greedy matching and replacement of symbols from the input dictionary to their output codes.

Output data is always a fixed amount of bits and these bits are essentially a coordinate in the dictionary to reconstitute.