NAME
AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.
DESCRIPTION
Common Optimization algorithms with regularizations.
create_optimizer
Create an optimizer with specified name.
Parameters
----------
name: str
Name of required optimizer. Should be the name
of a subclass of Optimizer. Case insensitive.
rescale_grad : float
Rescaling factor on gradient. Normally should be 1/batch_size.
kwargs: dict
Parameters for optimizer
Returns
-------
opt : Optimizer
The result optimizer.
set_lr_mult
Set individual learning rate multipler for parameters
Parameters
----------
args_lr_mult : dict of string/int to float
set the lr multipler for name/index to float.
setting multipler by index is supported for backward compatibility,
but we recommend using name and symbol.
set_wd_mult
Set individual weight decay multipler for parameters.
By default wd multipler is 0 for all params whose name doesn't
end with _weight, if param_idx2name is provided.
Parameters
----------
args_wd_mult : dict of string/int to float
set the wd multipler for name/index to float.
setting multipler by index is supported for backward compatibility,
but we recommend using name and symbol.
NAME
AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.
DESCRIPTION
A very simple SGD optimizer with momentum and weight regularization.
Parameters
----------
learning_rate : float, optional
learning_rate of SGD
momentum : float, optional
momentum value
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
param_idx2name : dict of string/int to float, optional
special treat weight decay in parameter ends with bias, gamma, and beta
NAME
AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.
DESCRIPTION
DCASGD optimizer with momentum and weight regularization.
Implements paper "Asynchronous Stochastic Gradient Descent with
Delay Compensation for Distributed Deep Learning"
Parameters
----------
learning_rate : float, optional
learning_rate of SGD
momentum : float, optional
momentum value
lamda : float, optional
scale DC value
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
param_idx2name : hash ref of string/int to float, optional
special treat weight decay in parameter ends with bias, gamma, and beta
NAME
AI::MXNet::NAG - SGD with Nesterov weight handling.
DESCRIPTION
It is implemented according to
https://github.com/torch/optim/blob/master/sgd.lua
NAME
AI::MXNet::SLGD - Stochastic Langevin Dynamics Updater to sample from a distribution.
DESCRIPTION
Stochastic Langevin Dynamics Updater to sample from a distribution.
Parameters
----------
learning_rate : float, optional
learning_rate of SGD
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
param_idx2name : dict of string/int to float, optional
special treat weight decay in parameter ends with bias, gamma, and beta
NAME
AI::MXNet::Adam - Adam optimizer as described in [King2014]_.
DESCRIPTION
Adam optimizer as described in [King2014]_.
.. [King2014] Diederik Kingma, Jimmy Ba,
*Adam: A Method for Stochastic Optimization*,
http://arxiv.org/abs/1412.6980
the code in this class was adapted from
https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L765
Parameters
----------
learning_rate : float, optional
Step size.
Default value is set to 0.001.
beta1 : float, optional
Exponential decay rate for the first moment estimates.
Default value is set to 0.9.
beta2 : float, optional
Exponential decay rate for the second moment estimates.
Default value is set to 0.999.
epsilon : float, optional
Default value is set to 1e-8.
decay_factor : float, optional
Default value is set to 1 - 1e-8.
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
NAME
AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011
DESCRIPTION
AdaGrad optimizer of Duchi et al., 2011,
This code follows the version in http://arxiv.org/pdf/1212.5701v1.pdf Eq(5)
by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster
in some cases.
Parameters
----------
learning_rate : float, optional
Step size.
Default value is set to 0.05.
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
eps: float, optional
A small float number to make the updating processing stable
Default value is set to 1e-7.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
NAME
AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.
DESCRIPTION
RMSProp optimizer of Tieleman & Hinton, 2012,
For centered=False, the code follows the version in
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by
Tieleman & Hinton, 2012
For centered=True, the code follows the version in
http://arxiv.org/pdf/1308.0850v5.pdf Eq(38) - Eq(45) by Alex Graves, 2013.
Parameters
----------
learning_rate : float, optional
Step size.
Default value is set to 0.001.
gamma1: float, optional
decay factor of moving average for gradient^2.
Default value is set to 0.9.
gamma2: float, optional
"momentum" factor.
Default value if set to 0.9.
Only used if centered=True
epsilon : float, optional
Default value is set to 1e-8.
centered : bool, optional
Use Graves or Tielemans & Hintons version of RMSProp
wd : float, optional
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]
clip_weights : float, optional
clip weights in range [-clip_weights, clip_weights]
NAME
AI::MXNet::AdaDelta - AdaDelta optimizer.
DESCRIPTION
AdaDelta optimizer as described in
Zeiler, M. D. (2012).
*ADADELTA: An adaptive learning rate method.*
http://arxiv.org/abs/1212.5701
Parameters
----------
rho: float
Decay rate for both squared gradients and delta x
epsilon : float
The constant as described in the thesis
wd : float
L2 regularization coefficient add to all the weights
rescale_grad : float, optional
rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
clip gradient in range [-clip_gradient, clip_gradient]