Attention

Neural Machine Translation By Jointly Learning to Align and Translate

One problem with traditional encoder-decoder structure is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector \(\mathbf{c}\). Attention does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

Background: RNN Encoder-Decoder

In the Encoder-Decoder framework, an encoder reads the input sentence, a sequence of vectors \(\mathbf{x} = (\mathbf{x}_1, ...., \mathbf{x}_{T_x})\), into a fixed length context vector \(\mathbf{c}\):

\[\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1})\] \[\mathbf{c} = q(\mathbf{h}_{1}, ...., \mathbf{h}_{T_x})\]

Where \(f\) and \(q\) are non-linear functions. Typically, \(q\) is the identity function defined as \(q(\mathbf{h}_{1}, ...., \mathbf{h}_{T_x}) = \mathbf{h}_{T_x}\).

The decoder is often trained to predict the next word \(\mathbf{y}_{t^{\prime}}\) given all the previous predicted words \(\{\hat{\mathbf{y}}_1, ..., \hat{\mathbf{y}}_{t^{\prime} - 1}\}\). In training step \(t\), we can use feed the true target value \(\mathbf{y}_{t-1}\). Then, in training of decoder, we are maximizing the log joint conditional probability (minimize the sum of per time step cross entropy):

\[L = \sum^{T_y}_{t=1} L^{t}\] \[-L = P_{\mathbf{Y}}(\mathbf{y}) = \sum^{T_y}_{t=1} P_{\mathbf{Y_t} | \mathbf{Y_1} ,...., \mathbf{Y_{t-1}}, \mathbf{c}}(\mathbf{y}_t | \{\mathbf{y}_{1}, ...., \mathbf{y}_{t-1}, \mathbf{c}\})\]

Where in RNN, each conditional probability distribution \(P_{\mathbf{Y_t} | \mathbf{Y_1} ,...., \mathbf{Y_{t-1}}, \mathbf{c}}\) is model as \(g(\mathbf{y}_{t-1}, \mathbf{s}_t, \mathbf{c})\).

Learning to Align and Translate

Attention contains bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a tranlation.

Decoder: General Description

In the new model, we define each conditional distribution as:

\[\hat{P}_{\mathbf{Y_t} | \mathbf{Y_1} ,...., \mathbf{Y_{t-1}}, \mathbf{c}} \triangleq g(\mathbf{y}_{t-1}, \mathbf{s}_t, \mathbf{c}_t)\]

Notice here, we have different \(\mathbf{c}_t\) for each time step \(t\).

The context vector \(\mathbf{c}_i\) depends on a sequence of annotations \((\mathbf{h}_1, ....., \mathbf{h}_{T_x})\) which contains information about the whole input sequence (similar to hidden units in encoder) with a strong focus on the parts surrounding the \(i\)th word of the input sequence.

The context vector \(\mathbf{c}_i\) is, then computed as a weighted sum of these annotations \(\mathbf{h}_{i}\):

\[\mathbf{c}_i = \sum^{T_x}_{j=1} \alpha_{ij} \mathbf{h}_j\]

\[\alpha_{ij} = \frac{\exp^{e_{ij}}}{\sum^{T_x}_{k=1} \exp^{e_{ik}}}\]

Where

\[e_{ij} = a(\mathbf{s}_{i-1}, \mathbf{h}_j)\]

is an alignment model which scores how well the inputs around position \(j\) and the output at position \(i\) match. This model \(a\) is parameterized by a MLP which is jointly trained with all the other components of the proposed system. The score is based on the RNN decoder's hidden state \(\mathbf{s}_{i-1}\) and the \(j\)th annotation \(\mathbf{h}_j\) of the input sentence. It reflects the importance of each annotation vector \(\mathbf{h}_j\) with respect to the previous hidden state \(\mathbf{s}_{i-1}\) in deciding the current state \(\mathbf{s}_i\) and generating prediction \(\hat{\mathbf{y}}_i\).

We can think the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments (\(\alpha_{ij}\)). In other words, let \(\alpha_{ij}\) be a probability that the target word \(\mathbf{y}_i\) is aligned to, or translated from a source input word \(\mathbf{x}_{j}\). Then, the \(i\)th context vector \(\mathbf{c}_i\) is the expected value of annotations (input sequence) distributed according to probability distribution defined by \(\alpha_{ij}\).

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.

Encoder: Bidirectional RNN for Annotating Sequences

A bidirectional RNN consists of forward and backward RNNs. The forward RNN \(\overset{\rightarrow}{f}\) reads the input sequence from the front and has forward hidden units \(\{\overset{\rightarrow}{\mathbf{h}}_1, ...., \overset{\rightarrow}{\mathbf{h}}_{T_x}\}\). \(\overset{\leftarrow}{f}\) reads the sequence in the reverse order (from \(\mathbf{x}_{T_x}\) to \(\mathbf{x}_1\)), resulting in a sequence of backward hidden units \(\{\overset{\leftarrow}{\mathbf{h}}_1, ...., \overset{\leftarrow}{\mathbf{h}}_{T_x}\}\).

The annotation vector \(\mathbf{h}_j\) is then calculated by concatenating the forward hidden state \(\overset{\rightarrow}{\mathbf{h}}_j\) and the backward hidden state \(\overset{\leftarrow}{\mathbf{h}}_j\):

\[\mathbf{h}_j = [\overset{\rightarrow}{\mathbf{h}}_j, \overset{\leftarrow}{\mathbf{h}}_j]\]

This sequence of annotations is used by the decoder and the alignment model later to compute the context vector \(\mathbf{c}_{i}\).

Effective Approaches to Attention-based Neural Machine Translation

Long Short-Term Memory-Networks for Machine Reading

Ref

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#self-attention