Transformer Neural Networks: A Step-by-Step Breakdown

Created on 2022-10-14T04:04:07-05:00

Indirectly a transformer is a better RNN. But in addition to the output of the RNN the hidden states are also provided to the decoding layer.

Some mechanism is used to "score" or select the relative importance of a particular hidden state against other hidden states available. Each hidden state is quashed by relative importance (softmax) and summed. This is the "context vector" which is passed to the decoding layer.

This is very similar to the k-winner-take-all system of Numenta's cortical neurons.