“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Created on 2023-07-13T18:25:19-05:00

Return to the Index

This card pertains to a resource available on the internet.

This card can also be read via Gemini.

Replaces deep neural networks with basic text compression algorithms.

Defines C(x) as some specimen of text which has been compressed.

Defines C(x,y) as concatenating X and Y with a space between them and running the compressor on the result.

Defines closeness ("NCD") as the size of C(x,y) subtracted from the minimum(C(x), C(y)) as a divisor over the maximum(C(x), C(y))

In the paper, GZip is used as the encoder. LZMA works similarly well. BZip2 does not work well.

The trick depends on LZW eliminating redundancies between exemplars and text being tested, with the most-redundant sample being the one which matches.