LReLU

Posted on 2021-07-05 Edited on 2021-07-22 Views:

Symbols count in article: 1k Reading time ≈ 1 mins.

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Background

A single hidden unit's activation \(h^{i}\) is given by:

\[h^{(i)} = \sigma ({\mathbf{w}^i}^T \mathbf{x})\]

Where \(\sigma (\cdot)\) is the tanh function.

Some drawbacks of tanh:

Vanishing gradient problem when lower layers of a DNN have gradients of nearly 0 because higher layer units are nearly saturated at \(-1\) or \(1\).
Does not produce a sparse representation in the sense of hard zero sparsity when using tanh.

ReLU addresses the vanishing gradient problem because activate units will constantly have gradient of 1. It also provides sparse representation which is useful in classification. However, an inactivate unit may never activate because gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. Thus, we might expect the learning to be slow.

Leaky ReLU

LReLU allows for a small, non-zero gradient when the unit is saturated and not active:

\[ h_i = \max({\mathbf{w}^i}^T \mathbf{x}, 0) = \begin{cases} {\mathbf{w}^i}^T \mathbf{x}, \quad & {\mathbf{w}^i}^T \mathbf{x} > 0\\ 0.01{\mathbf{w}^i}^T \mathbf{x}, \quad & {\mathbf{w}^i}^T \mathbf{x} \leq 0\\ \end{cases} \]