NAME

AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.

DESCRIPTION

Common Optimization algorithms with regularizations.

create_optimizer

Create an optimizer with specified name.

Parameters
----------
name: str
    Name of required optimizer. Should be the name
    of a subclass of Optimizer. Case insensitive.

rescale_grad : float
    Rescaling factor on gradient. Normally should be 1/batch_size.

kwargs: dict
    Parameters for optimizer

Returns
-------
opt : Optimizer
    The result optimizer.

set_lr_mult

Set individual learning rate multipler for parameters

Parameters
----------
args_lr_mult : dict of string/int to float
    set the lr multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.

set_wd_mult

Set individual weight decay multipler for parameters.
By default wd multipler is 0 for all params whose name doesn't
end with _weight, if param_idx2name is provided.

Parameters
----------
args_wd_mult : dict of string/int to float
    set the wd multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.

NAME

AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.

DESCRIPTION

A very simple SGD optimizer with momentum and weight regularization.

If the storage types of weight and grad are both 'row_sparse', and 'lazy_update' is True,
**lazy updates** are applied by

    for row in grad.indices:
        rescaled_grad[row] = lr * rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row]
        state[row] = momentum[row] * state[row] + rescaled_grad[row]
        weight[row] = weight[row] - state[row]

The sparse update only updates the momentum for the weights whose row_sparse
gradient indices appear in the current batch, rather than updating it for all
indices. Compared with the original update, it can provide large
improvements in model training throughput for some applications. However, it
provides slightly different semantics than the original update, and
may lead to different empirical results.

Otherwise, **standard updates** are applied by::

    rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
    state = momentum * state + rescaled_grad
    weight = weight - state

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : hash of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta

multi_precision: bool, optional
    Flag to control the internal precision of the optimizer.
    False results in using the same precision as the weights (default),
    True makes internal 32-bit copy of the weights and applies gradients
    in 32-bit precision even if actual weights used in the model have lower precision.
    Turning this on can improve convergence and accuracy when training with float16.

lazy_update: Bool, optional, default true

NAME

AI::MXNet::Signum - The Signum optimizer that takes the sign of gradient or momentum.

DESCRIPTION

The optimizer updates the weight by:

    rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
    state = momentum * state + (1-momentum)*rescaled_grad
    weight = (1 - lr * wd_lh) * weight - lr * sign(state)

See the original paper at: https://jeremybernste.in/projects/amazon/signum.pdf

For details of the update algorithm see
:class:`~mxnet.ndarray.signsgd_update` and :class:`~mxnet.ndarray.signum_update`.

This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.

Parameters
----------
momentum : float, optional
   The momentum value.
wd_lh : float, optional
   The amount of decoupled weight decay regularization, see details in the original paper at:\
   https://arxiv.org/abs/1711.05101

NAME

AI::MXNet::FTML - The FTML optimizer.

DESCRIPTION

This class implements the optimizer described in
*FTML - Follow the Moving Leader in Deep Learning*,
available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.

This optimizer accepts the following parameters in addition to those accepted
by AI::MXNet::Optimizer

Parameters
----------
beta1 : float, optional
    0 < beta1 < 1. Generally close to 0.5.
beta2 : float, optional
    0 < beta2 < 1. Generally close to 1.
epsilon : float, optional
    Small value to avoid division by 0.

NAME

AI::MXNet::LBSGD - The Large Batch SGD optimizer with momentum and weight decay.

DESCRIPTION

The optimizer updates the weight by::

    state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
    weight = weight - state

Parameters
----------
momentum : float, optional
   The momentum value.
multi_precision: bool, optional
   Flag to control the internal precision of the optimizer.
   ``False`` results in using the same precision as the weights (default),
   ``True`` makes internal 32-bit copy of the weights and applies gradients
            in 32-bit precision even if actual weights used in the model have lower precision.`<
            Turning this on can improve convergence and accuracy when training with float16.
warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars'   default : 'linear')
warmup_epochs: unsigned, default: 5
batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.)
begin_epoch: unsigned, default 0, starting epoch.

NAME

AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.

DESCRIPTION

DCASGD optimizer with momentum and weight regularization.

Implements paper "Asynchronous Stochastic Gradient Descent with
                Delay Compensation for Distributed Deep Learning"

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

lamda : float, optional
   scale DC value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : hash ref of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta

NAME

AI::MXNet::NAG - SGD with Nesterov weight handling.

DESCRIPTION

It is implemented according to
https://github.com/torch/optim/blob/master/sgd.lua

NAME

AI::MXNet::SGLD - Stochastic Gradient Riemannian Langevin Dynamics.

DESCRIPTION

Stochastic Gradient Riemannian Langevin Dynamics.

This class implements the optimizer described in the paper *Stochastic Gradient
Riemannian Langevin Dynamics on the Probability Simplex*, available at
https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf.

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::Adam - Adam optimizer as described in [King2014]_.

DESCRIPTION

Adam optimizer as described in [King2014]_.

.. [King2014] Diederik Kingma, Jimmy Ba,
   *Adam: A Method for Stochastic Optimization*,
   http://arxiv.org/abs/1412.6980

the code in this class was adapted from
https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L765

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
    Default value is set to 0.9.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.
    Default value is set to 0.999.
epsilon : float, optional
    Default value is set to 1e-8.

wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011

DESCRIPTION

AdaGrad optimizer of Duchi et al., 2011,

This code follows the version in http://arxiv.org/pdf/1212.5701v1.pdf  Eq(5)
by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster
in some cases.

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.05.

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

eps: float, optional
    A small float number to make the updating processing stable
    Default value is set to 1e-7.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.

DESCRIPTION

RMSProp optimizer of Tieleman & Hinton, 2012,

For centered=False, the code follows the version in
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by
Tieleman & Hinton, 2012

For centered=True, the code follows the version in
http://arxiv.org/pdf/1308.0850v5.pdf Eq(38) - Eq(45) by Alex Graves, 2013.

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
gamma1: float, optional
    decay factor of moving average for gradient^2.
    Default value is set to 0.9.
gamma2: float, optional
    "momentum" factor.
    Default value if set to 0.9.
    Only used if centered=True
epsilon : float, optional
    Default value is set to 1e-8.
centered : bool, optional
    Use Graves or Tielemans & Hintons version of RMSProp
wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]
clip_weights : float, optional
    clip weights in range [-clip_weights, clip_weights]

NAME

AI::MXNet::AdaDelta - AdaDelta optimizer.

DESCRIPTION

AdaDelta optimizer as described in
Zeiler, M. D. (2012).
*ADADELTA: An adaptive learning rate method.*

http://arxiv.org/abs/1212.5701

Parameters
----------
rho: float
    Decay rate for both squared gradients and delta x
epsilon : float
    The constant as described in the thesis
wd : float
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::Ftrl

DESCRIPTION

Referenced from *Ad Click Prediction: a View from the Trenches*, available at
http://dl.acm.org/citation.cfm?id=2488200.

eta :
    .. math::
       \\eta_{t,i} = \\frac{learningrate}{\\beta+\\sqrt{\\sum_{s=1}^tg_{s,i}^2}}

The optimizer updates the weight by::

    rescaled_grad = clip(grad * rescale_grad, clip_gradient)
    z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate
    n += rescaled_grad**2
    w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1)

If the storage types of weight, state and grad are all ``row_sparse``, \
**sparse updates** are applied by::

    for row in grad.indices:
        rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient)
        z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate
        n[row] += rescaled_grad[row]**2
        w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1)

The sparse update only updates the z and n for the weights whose row_sparse
gradient indices appear in the current batch, rather than updating it for all
indices. Compared with the original update, it can provide large
improvements in model training throughput for some applications. However, it
provides slightly different semantics than the original update, and
may lead to different empirical results.

For details of the update algorithm, see :class:`~mxnet.ndarray.ftrl_update`.

This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.

Parameters
----------
lamda1 : float, optional
    L1 regularization coefficient.
learning_rate : float, optional
    The initial learning rate.
beta : float, optional
    Per-coordinate learning rate correlation parameter.

NAME

AI::MXNet::Adamax

DESCRIPTION

It is a variant of Adam based on the infinity norm
available at http://arxiv.org/abs/1412.6980 Section 7.

This optimizer accepts the following parameters in addition to those accepted
AI::MXNet::Optimizer.

Parameters
----------
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.

NAME

AI::MXNet::Nadam

DESCRIPTION

The Nesterov Adam optimizer.

Much like Adam is essentially RMSprop with momentum,
Nadam is Adam RMSprop with Nesterov momentum available
at http://cs229.stanford.edu/proj2015/054_report.pdf.

This optimizer accepts the following parameters in addition to those accepted
AI::MXNet::Optimizer.

Parameters
----------
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.
epsilon : float, optional
    Small value to avoid division by 0.
schedule_decay : float, optional
    Exponential decay rate for the momentum schedule

set_states

Sets updater states.

get_states

Gets updater states.

Parameters
----------
dump_optimizer : bool, default False
    Whether to also save the optimizer itself. This would also save optimizer
    information such as learning rate and weight decay schedules.