The illustrated attention mechanism
These are my notes for the blog post Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention).
Reference
Reference
Summary
The post describes how the attention mechanism works through a series of animations in the context of natural language processing.
Classical Encoder-Decoder Recurrent Model
This is the classical model to deal with machine translation task. A nice introduction to the subject can be found in these two papers:
- Sequence to Sequence Learning with Neural Networks
- Linguistics Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Encoder
The encoder receives as input a token and a hidden state \(h_i\) and it produces a hidden state \(h_o\). The encoders are aligned sequentially such that the hidden state output by the the first encoder is the input hidden state of the second encoder. The encoder \(n\) has the token \(n\) as input.
After having all the tokens encoded (normally those in a sentence in the context of translation) the final hidden state \(h_e\) is used as the input of the decode stage.
We can think of \(h_e\) as a context vector. The hidden state \(h_e\) concentrates all the information in the sentence. There are several ways of doing that. The simplest one would be summing and averaging, for example.
Decoder
The decoder would take this context vector to start the translation process. Each decoder layer also processes token by token (how it knows which token to process? All the input tokens are mangled in the context vector). The output of each decoder layer is a translated word and a hidden state. The hidden state is forwarded as an input to the following decoder.
Attention Mechanism
The motivation for the attention mechanism comes from the difficulty of sequence-to-sequence models (usually modeled by recurrent networks) to process long sequences. This fits perfectly in language processing where the order of the words is important for understanding.
The idea of the attention mechanism is to emphasize important tokens in the current stage of processing. The attention is an old idea revived in these two papers
- Neural Machine Translation by Jointly Learning to Align and Translate
- Effective Approaches to Attention-based Neural Machine Translation

The attention step scores all the hidden states produced by the encoder part (note that this is different from classical RNN, where only the last hidden state produced by the encoder is passed to the decoder) to produce a context vector. This vector is concatenated with the hidden state outputed by the decoder and this concatenated vector if forwarded to a FFN that will finally produce the translated token.
