“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
Created on 2023-07-13T18:25:19-05:00
Replaces deep neural networks with basic text compression algorithms.
Defines C(x) as some specimen of text which has been compressed.
Defines C(x,y) as concatenating X and Y with a space between them and running the compressor on the result.
Defines closeness ("NCD") as the size of C(x,y) subtracted from the minimum(C(x), C(y)) as a divisor over the maximum(C(x), C(y))
In the paper, GZip is used as the encoder. LZMA works similarly well. BZip2 does not work well.
The trick depends on LZW eliminating redundancies between exemplars and text being tested, with the most-redundant sample being the one which matches.