Momentum

Momentum

While SGD remains a very popular optimization strategy, learning with it can be slow. The method of momentum is designed to accelerate learning, especially in the face of high curvature (large change of direction of the curve in small amount time), small but consistent gradients (flat surface) or noisy gradients (with high variance). The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.


The momentum algorithm introduces several variables:

  1. A vector \(\mathbf{v}\) that plays a role of velocity (with direction and speed). The velocity is set to an exponentially decaying average of the negative gradient.
  2. A hyparameter \(\alpha \in [0, 1)\) determins how quickly the contributions of previous gradients exponentially decay.

The update rule is given by:

\[\mathbf{v} \leftarrow \alpha \mathbf{v} - \epsilon \nabla_{\boldsymbol{\theta}} (\frac{1}{N} \sum^{N}_{i=1} L(\mathbf{f} (\mathbf{x}_i; \; \boldsymbol{\theta}), \mathbf{y}_i))\]

\[\boldsymbol{\theta} \rightarrow \boldsymbol{\theta} + \mathbf{v}\]

Previously, in SGD, the size of the step was simply the norm of the gradient multiplied by the learning rate:

\[\epsilon \nabla_{\boldsymbol{\theta}} (\frac{1}{N} \sum^{N}_{i=1} L(\mathbf{f} (\mathbf{x}_i; \; \boldsymbol{\theta}), \mathbf{y}_i))\]

Now the size of the step depends on how large and how aligned a sequence of gradients are. The step size is largest when many successive gradients point in exactly the same direction. If the momentum algorithm always observe gradient \(\mathbf{g}\), then it will accelerate in the direction of \(-\mathbf{g}\):

\[\mathbf{v}_1 \leftarrow - \epsilon \mathbf{g}\]

\[\mathbf{v}_2 \leftarrow -\mathbf{g} \epsilon(\alpha + 1)\]

\[\mathbf{v}_N \leftarrow -\mathbf{g} \epsilon \sum^{N-1}_{i=0} \alpha^i\]

\[\implies \|\mathbf{v}_{\infty}\| = \frac{\epsilon \|\mathbf{g}\|}{1 - \alpha}\]

The terminal velocity will have speed \(\frac{\epsilon \|\mathbf{g}\|}{1 - \alpha} \gg \epsilon \|\mathbf{g}\|\). This makes sense because if we receive consistent small gradients, we would like to take larger steps because we are confident we are in the right direction. One the other hand, consistently changing direction gradients (high curvature) would cause the gradient to be small to allow convergence.

Nesterov Momentum

Nesterov Momentum is inspired by Nesterov's accelerated gradient method, the update rules in this case are given by:

\[\mathbf{v} \leftarrow \alpha \mathbf{v} - \epsilon \nabla_{\boldsymbol{\theta}} (\frac{1}{N} \sum^{N}_{i=1} L(\mathbf{f} (\mathbf{x}_i; \; \boldsymbol{\theta} + \alpha \mathbf{v}), \mathbf{y}_i))\]

\[\boldsymbol{\theta} \rightarrow \boldsymbol{\theta} + \mathbf{v}\]

This is similar to momentum, but before taking the gradient, we first take one step forward using previous velocity, then we take the gradient there and adjust velocity accordingly. We can also think of this as attempting to add a correlation factor to the standard method of momentum.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class SGD:
def __init__(self, lr, model_vars):
self.model_vars = model_vars
self.lr = lr

def __call__(self, grad):
for i, v in enumerate(self.model_vars):
self.model_vars[i] = v - self.lr * grad[i]

return self.model_vars

class Momentum:
def __init__(self, lr, model_vars, alpha=0.9):
self.lr = lr
self.model_vars = model_vars
self.alpha = alpha
self.v = 0

def __call__(self, grad):
for i, v in enumerate(self.model_vars):
self.v = self.v * self.alpha - self.lr * grad[i]
self.model_vars[i] = self.model_vars[i] + self.v

return self.model_vars

Ref