RNN
Recurrent Neural Network
Introduction
Consider the recurrent equation:
For a finite time step
Then, this expression can now be represented as a DAG because it no longer involves recurrence:
Notice here, the parameters
We can see that now,
Many Recurrent Neural Networks use similar idea to express their hidden units:
Typically, RNN will have output layers to output predictions at given timesteps. When the recurrent network is trained to perform a task that requires predicting the future from the past, the network typically learns to use
The unfolded structure has several advantages:
- The learned model
is defined as transition from hidden units (input) to (output) regardless the value of time . Thus, we can have one model for different lengths of sequences. - The parameters
are shared.
Forward Pass
In formulas, the forward pass for same output and input length RNN with tanh non-linearity and probability as prediction can be described as starting from
is the hidden to hidden weights that encodes information about past sequences. is the hidden to output weights that is responsible for prediction using current hidden units. is the input to hidden weights that is responsible for input parameterization. is the bias associated with is the bias associated with
Then the multi-class cross entropy loss of the sequence of
Where
Where
Teacher Forcing and Networks with Output Recurrence
RNNs are expensive to train. The runtime is
To reduce computational and memory cost, one way is to remove the hidden to hidden recurrence by a target to hidden recurrence. Thus, for any loss function based on comparing the prediction at time
This technique is called teacher forcing
. Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output
The disadvantage of strict teacher forcing arises if the network is going to be later used in an open-loop model, with the network outputs fed back as input. In this case, there is a gap between training and testing. One approach to collapse this gap is to mitigate the gap between the inputs seen at train time and the inputs seen at test time randomly chooses to use generated values or actual data values as input.
Backward Pass
The backward propagation for RNN is called back propagation through time
:
Assume we have RNN structure as forward pass (gradients are reshaped to column vectors,
Then the gradient for hidden units in last time step
Work our way backwards for
Since
Then:
Recall that in RNN, all parameters are shared, so
Bidirectional RNNs
The RNN structures so far only captures information from the past (i.e output at time step
is the hidden units of sub-RNN that moves forward through time. is the hidden units of sub-RNN that moves backward through time. is calculated using both , but is more sensitive to input values around without having to specify a fixed-size window around .
Encoder-Decoder
We know that we can use RNN to encode inputs to a fixed length vector, and we know that we can map a fixed-size vector to a sequence. Previously, we have one output corresponding to one input, the input sequence and the output sequence have same length. Using encoder-decoder
or sequence to sequence
structure, we can train an encoder
RNN to process the input sequence context vector
decoder
RNN.
In the sequence to sequence structure, the two RNNs are trained jointly to maximize the average of log conditional probability
One clear limitation of this architecture is when the context attention
that learns to associate elements of the sequence
Gated RNN
Gated RNN
includes long short-term memory
and networks based on the gated recurrent unit
. Gated RNNs are based on the idea of creating paths through time that have derivatives that neither banish nor explode. Gated RNNs did this with connection weights that may change at each time step.
LSTM
Instead of a unit that simply applies an element wise linearity to the affine transformation of inputs and recurrent units, LSTM recurrent networks have LSTM cells that have an internal recurrence in addition to the outer recurrence of the RNN. Each cell has more parameters and a system of gating units that controls the flow of information. The most important component is the state unit
For each timestep
Where
We can see that if
Then:
In LSTMs, you have the state
Where
GRU
The main difference with the LSTM is that a single gating unit simultaneously controls the forgetting factor and the decision to update the state unit so we have less parameters.
The input and output to GRU is same as RNN
GRU has two gates at each timestep
Reset gate
:Update gate
:
The hidden unit update equation is:
Where
In summary:
- Reset gates help capture short-term dependencies in sequences (by integrating part of past information with new information to produce the candidate).
- Update gates help capture long-term dependencies in sequences.
Ref
https://zhuanlan.zhihu.com/p/32085405
https://zhuanlan.zhihu.com/p/32481747
https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem
https://d2l.ai/chapter_recurrent-modern/gru.html