Gradient Descent

An overview of gradient descent optimization algorithms

Sebastian Ruder, Insight Centre for Data Analytics, NUI Galway Aylien Ltd., Dublin, 2017

Gradient Descent

Multi-variable function $f : R^{n} \mapsto R$ , defined differentiable in a neighborhood of a point $x \in R^{n}$ , for $λ \in R_{+}$ small enough,

$x_{k + 1} = x_{k} - λ \nabla f (x_{k})$

leads to $f (x_{k + 1}) \leq f (x_{k})$ .

If $f$ convex and $\nabla f$ Lipschitz, $f_{k}$ converge to a local mimimum.

Optimization : Momentum

Let $γ < 1, v_{0} = 0$ ,

${x_{k + 1} v_{k} = x_{k} - v_{k} = γ v_{k - 1} + λ \nabla f (x_{k})$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

Optimization : Nesterov

Version with correction,

${x_{k + 1} v_{k} = x_{k} - v_{k} = γ v_{k - 1} + λ \nabla f (x_{k} - γ v_{k - 1})$

This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks.

Adagrad

It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$x_{k + 1} = x_{k} - \frac{λ}{G _{k} + ϵ} \nabla f (x_{k})$

Application : learned to recognize cats in Youtube videos; GloVe word embeddings.

Adadelta

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.

RMSprop

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta.

$v (x_{k}, t) : = γ v (x_{k}, t - 1) + (1 - γ) (\nabla f (x_{k}))^{2}$

$x_{k + 1} = x_{k} - \frac{η}{v ( x _{k} , t )} \nabla f (x_{k})$

Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients $v_{k}$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $u_{k}$ , similar to momentum.

$u_{k} v_{k} = β_{1} u_{k - 1} + (1 - β_{1}) \nabla f (x_{k}) = β_{2} v_{k - 1} + (1 - β_{2}) \nabla^{2} f (x_{k})$

$\overset{u}{^}_{k} \overset{v}{^}_{k} = \frac{u _{k}}{1 - β _{1}} = \frac{v _{k}}{1 - β _{2}}$

$x_{k + 1} = x_{k} - \frac{η}{v ^ _{k} + ϵ} \nabla f (x_{k})$

PyTorch Implementation

SGD, Momentum

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

$v_{t + 1} p_{t + 1} = μ * v_{t} + g_{t + 1}, = p_{t} - lr * v_{t + 1},$

where $p$ , $g$ , $v$ and $μ$ denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

$v_{t + 1} p_{t + 1} = μ * v_{t} + lr * g_{t + 1}, = p_{t} - v_{t + 1} .$

The Nesterov version is analogously modified.

# torch/optim/sgd.py

def _single_tensor_sgd(params: List[Tensor],
                       d_p_list: List[Tensor],
                       momentum_buffer_list: List[Optional[Tensor]],
                       *,
                       weight_decay: float,
                       momentum: float,
                       lr: float,
                       dampening: float,
                       nesterov: bool,
                       maximize: bool,
                       has_sparse_grad: bool):

    for i, param in enumerate(params):

        d_p = d_p_list[i]
        if weight_decay != 0:
            d_p = d_p.add(param, alpha=weight_decay)

        if momentum != 0:
            buf = momentum_buffer_list[i]

            if buf is None:
                buf = torch.clone(d_p).detach()
                momentum_buffer_list[i] = buf
            else:
                buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

            if nesterov:
                d_p = d_p.add(buf, alpha=momentum)
            else:
                d_p = buf

        alpha = lr if maximize else -lr
        param.add_(d_p, alpha=alpha)

Adagrad

Algorithm

$input : γ (lr), θ_{0} (params), f (θ) (objective), λ (weight decay), τ (initial accumulator value), η (lr decay) initialize : s t a t e s u m_{0} \leftarrow 0 for t = 1 to \dots do g_{t} \leftarrow \nabla_{θ} f_{t} (θ_{t - 1}) γ \leftarrow γ / (1 + (t - 1) η) if λ \neq = 0 g_{t} \leftarrow g_{t} + λ θ_{t - 1} s t a t e s u m_{t} \leftarrow s t a t e s u m_{t - 1} + g_{t}^{2} θ_{t} \leftarrow θ_{t - 1} - γ \frac{g _{t}}{s t a t e s u m _{t} + ϵ} r e t u r n θ_{t}$

# torch/optim/adagrad.py

def _single_tensor_adagrad(params: List[Tensor],
                           grads: List[Tensor],
                           state_sums: List[Tensor],
                           state_steps: List[Tensor],
                           *,
                           lr: float,
                           weight_decay: float,
                           lr_decay: float,
                           eps: float,
                           has_sparse_grad: bool):

    for (param, grad, state_sum, step_t) in zip(params, grads, state_sums, state_steps):
        # update step
        step_t += 1
        step = step_t.item()

        if weight_decay != 0:
            if grad.is_sparse:
                raise RuntimeError("weight_decay option is not compatible with sparse gradients")
            grad = grad.add(param, alpha=weight_decay)

        clr = lr / (1 + (step - 1) * lr_decay)

        if grad.is_sparse:
            grad = grad.coalesce()  # the update is non-linear so indices must be unique
            grad_indices = grad._indices()
            grad_values = grad._values()
            size = grad.size()

            state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))
            std = state_sum.sparse_mask(grad)
            std_values = std._values().sqrt_().add_(eps)
            param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)
        else:
            is_complex = torch.is_complex(param)
            if is_complex:
                grad = torch.view_as_real(grad)
                state_sum = torch.view_as_real(state_sum)
                param = torch.view_as_real(param)
            state_sum.addcmul_(grad, grad, value=1)
            std = state_sum.sqrt().add_(eps)
            param.addcdiv_(grad, std, value=-clr)
            if is_complex:
                param = torch.view_as_complex(param)
                state_sum = torch.view_as_complex(state_sum)

Adam

Algorithm

$input : γ (lr), β_{1}, β_{2} (betas), θ_{0} (params), f (θ) (objective) λ (weight decay), amsgrad, maximize initialize : m_{0} \leftarrow 0 ( first moment), v_{0} \leftarrow 0 (second moment), v_{0}^{m a x} \leftarrow 0 for t = 1 to \dots do if maximize : g_{t} \leftarrow - \nabla_{θ} f_{t} (θ_{t - 1}) else g_{t} \leftarrow \nabla_{θ} f_{t} (θ_{t - 1}) if λ \neq = 0 g_{t} \leftarrow g_{t} + λ θ_{t - 1} m_{t} \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) g_{t} v_{t} \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} m_{t} \leftarrow m_{t} / (1 - β_{1}^{t}) v_{t} \leftarrow v_{t} / (1 - β_{2}^{t}) if a m s g r a d v_{t}^{m a x} \leftarrow m a x (v_{t}^{m a x}, v_{t}) θ_{t} \leftarrow θ_{t - 1} - γ m_{t} / (v_{t}^{m a x} + ϵ) else θ_{t} \leftarrow θ_{t - 1} - γ m_{t} / (v_{t} + ϵ) r e t u r n θ_{t}$

# torch/optim/adam.py

def _single_tensor_adam(params: List[Tensor],
                        grads: List[Tensor],
                        exp_avgs: List[Tensor],
                        exp_avg_sqs: List[Tensor],
                        max_exp_avg_sqs: List[Tensor],
                        state_steps: List[Tensor],
                        *,
                        amsgrad: bool,
                        beta1: float,
                        beta2: float,
                        lr: float,
                        weight_decay: float,
                        eps: float,
                        maximize: bool):

    for i, param in enumerate(params):

        grad = grads[i] if not maximize else -grads[i]
        exp_avg = exp_avgs[i]
        exp_avg_sq = exp_avg_sqs[i]
        step_t = state_steps[i]
        # update step
        step_t += 1
        step = step_t.item()

        bias_correction1 = 1 - beta1 ** step
        bias_correction2 = 1 - beta2 ** step

        if weight_decay != 0:
            grad = grad.add(param, alpha=weight_decay)

        # Decay the first and second moment running average coefficient
        exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
        if amsgrad:
            # Maintains the maximum of all 2nd moment running avg. till now
            torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
            # Use the max. for normalizing running avg. of gradient
            denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
        else:
            denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)



        step_size = lr / bias_correction1
        # param = param - step_size * (exp_avg / denom)
        # element-wise division
        param.addcdiv_(exp_avg, denom, value=-step_size)

AdamW is Adam with correct Weight Decay, when weight decay is 0, there is no difference between Adam and AdamW.

Aller au boulot