Two sparsities are better than one: unlocking the performance benefits of sparse–sparse networks

Created on 2022-09-12T09:56:34-05:00

Sparse connections: only so many weights between neurons are non-zero.

Sparse activations: only so many inputs to the cluster are non-zero.

Sparse-sparse networks had a 33x speed up compared to non-sparse networks.

On average less than 2% of neurons fire for any given input. This is true for all sensory modalities as well as areas that deal with language, abstract thought, planning, etc.

Even for DNNs with high-degrees of weight sparsity, the performance gains observed are small. For example, on CPUs, even for weight sparse networks in which 95% of the neuron weights have been eliminated, the performance improvements observed are typically less than 4×

Complementary Sparsity

We achieve this by overlaying multiple sparse matrices to form a single dense structure. An optimal packing can be readily achieved if no two sparse matrices contain a non-zero element at precisely the same location. Given incoming activations, we perform an element-wise product with the incoming activations (a dense operation) and then recreate each individual sum.

Quinn: Basically if you have 3 active ports out of 20 then it becomes possible to overlap different sensors (which also use 3 ports of 20) as long as none of the cells overlap. Thus a matrix of 20 units can compute the values of the multiple sparse elements together at once.

k-winner take all: take N of the highest powered inputs and drop the rest (set them to zero.)

k-winner take all is used to turn a dense input in to a sparse (SDR-like) array