gradient descent

Multi-variable function $f : R^{n} \mapsto R$ , defined differentiable in a neighborhood of a point $x \in R^{n}$ , for $λ \in R_{+}$ small enough,

$x_{k + 1} = x_{k} - λ \nabla f (x_{k})$

leads to $f (x_{k + 1}) \leq f (x_{k})$ .

If $f$ convex and $\nabla f$ Lipschitz, $f_{k}$ converge to a local mimimum.

Optimization : Momentum

Let $γ < 1, v_{0} = 0$ ,

${x_{k + 1} v_{k} = x_{k} - v_{k} = γ v_{k - 1} + λ \nabla f (x_{k})$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

Optimization : Nesterov

Version with correction,

${x_{k + 1} v_{k} = x_{k} - v_{k} = γ v_{k - 1} + λ \nabla f (x_{k} - γ v_{k - 1})$

This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks.

Adagrad

It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$x_{k + 1} = x_{k} - \frac{λ}{G _{k} + ϵ} \nabla f (x_{k})$

Application : learned to recognize cats in Youtube videos; GloVe word embeddings.

Adadelta

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.

RMSprop

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta.

$v (x_{k}, t) : = γ v (x_{k}, t - 1) + (1 - γ) (\nabla f (x_{k}))^{2}$

$x_{k + 1} = x_{k} - \frac{η}{v ( x _{k} , t )} \nabla f (x_{k})$

Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients $v_{k}$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $u_{k}$ , similar to momentum.

$u_{k} v_{k} = β_{1} u_{k - 1} + (1 - β_{1}) \nabla f (x_{k}) = β_{2} v_{k - 1} + (1 - β_{2}) \nabla^{2} f (x_{k})$

$\overset{u}{^}_{k} \overset{v}{^}_{k} = \frac{u _{k}}{1 - β _{1}} = \frac{v _{k}}{1 - β _{2}}$

$x_{k + 1} = x_{k} - \frac{η}{v ^ _{k} + ϵ} \nabla f (x_{k})$