objax.optimizer package¶

`Adam`(vc[, beta1, beta2, eps])	Adam optimizer.
`ExponentialMovingAverage`(vc[, momentum, …])	Maintains exponential moving averages for each variable from provided VarCollection.
`Momentum`(vc[, momentum, nesterov])	Momentum optimizer.
`SGD`(vc)	Stochastic Gradient Descent (SGD) optimizer.

class objax.optimizer.Adam(vc, beta1=0.9, beta2=0.999, eps=1e-08)[source]¶

Adam optimizer.

Adam is an adaptive learning rate optimization algorithm originally presented in Adam: A Method for Stochastic Optimization. Specifically, when optimizing a loss function \(f\) parameterized by model weights \(w\), the update rule is as follows:

\[\begin{split}\begin{eqnarray} v_{k} &=& \beta_1 v_{k-1} + (1 - \beta_1) \nabla f (.; w_{k-1}) \nonumber \\ s_{k} &=& \beta_2 s_{k-1} - (1 - \beta_2) (\nabla f (.; w_{k-1}))^2 \nonumber \\ \hat{v_{k}} &=& \frac{v_{k}}{(1 - \beta_{1}^{k})} \nonumber \\ \hat{s_{k}} &=& \frac{s_{k}}{(1 - \beta_{2}^{k})} \nonumber \\ w_{k} &=& w_{k-1} - \eta \frac{\hat{v_{k}}}{\sqrt{\hat{s_{k}}} + \epsilon} \nonumber \end{eqnarray}\end{split}\]

Adam updates exponential moving averages of the gradient \((v_{k})\) and the squared gradient \((s_{k})\) where the hyper-parameters \(\beta_1\) and \(\beta_2 \in [0, 1)\) control the exponential decay rates of these moving averages. The \(\eta\) constant in the weight update rule is the learning rate and is passed as a parameter in the __call__ method. Note that the implementation uses the approximation \(\sqrt{(\hat{s_{k}} + \epsilon)} \approx \sqrt{\hat{s_{k}}} + \epsilon\).

__init__(vc, beta1=0.9, beta2=0.999, eps=1e-08)[source]¶

Constructor for Adam optimizer class.

Parameters

vc (objax.variable.VarCollection) – collection of variables to optimize.
beta1 (float) – value of Adam’s beta1 hyperparameter. Defaults to 0.9.
beta2 (float) – value of Adam’s beta2 hyperparameter. Defaults to 0.999.
eps (float) – value of Adam’s epsilon hyperparameter. Defaults to 1e-8.

__call__(lr, grads)[source]¶

Updates variables and other state based on Adam algorithm.

Parameters

lr (float) – the learning rate.
grads (List[Union[jax.numpy.lax_numpy.ndarray, jax.interpreters.xla.DeviceArray, jax.interpreters.pxla.ShardedDeviceArray]]) – the gradients to apply.

class objax.optimizer.ExponentialMovingAverage(vc, momentum=0.999, debias=False, eps=1e-06)[source]¶

Maintains exponential moving averages for each variable from provided VarCollection.

When training a model, it is often beneficial to maintain exponential moving averages (EMA) of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values (see Acceleration of Stochastic Approximation by Averaging).

This maintains an EMA of the parameters passed in the VarCollection vc. The EMA update rule for weights \(w\), the EMA \(m\) at step \(t\) when using a momentum \(\mu\) is:

\[m_t = \mu m_{t-1} + (1 - \mu) w_t\]

The EMA weights \(\hat{w_t}\) are simply \(m_t\) when debias=False. When debias=True, the EMA weights are defined as:

\[\hat{w_t} = \frac{m_t}{1 - (1 - \epsilon)\mu^t}\]

Where \(\epsilon\) is a small constant to avoid a divide-by-0.

__init__(vc, momentum=0.999, debias=False, eps=1e-06)[source]¶

Creates ExponentialMovingAverage instance with given hyperparameters.

Parameters

momentum (float) – the decay factor for the moving average.
debias (bool) – bool indicating whether to use initialization bias correction.
eps (float) – small adjustment to prevent division by zero.
vc (objax.variable.VarCollection) –

__call__()[source]¶: Updates the moving average.

refs_and_values()[source]¶

Returns the VarCollection of variables affected by Exponential Moving Average (EMA) and their corresponding EMA values.

Return type: Tuple[objax.variable.VarCollection, List[Union[jax.numpy.lax_numpy.ndarray, jax.interpreters.xla.DeviceArray, jax.interpreters.pxla.ShardedDeviceArray]]]

replace_vars(f)[source]¶

Returns a function that acts as f called when variables are replaced by their averages.

Parameters: f (Callable) – function to be called on the stored averages.
Returns: A function that returns the output of calling f with stored variables replaced by their moving averages.

class objax.optimizer.Momentum(vc, momentum=0.9, nesterov=False)[source]¶

Momentum optimizer.

The momentum optimizer (expository article) introduces a tweak to the standard gradient descent. Specifically, when optimizing a loss function \(f\) parameterized by model weights \(w\) the update rule is as follows:

\[\begin{split}\begin{eqnarray} v_{k} &=& \mu v_{k-1} + \nabla f (.; w_{k-1}) \nonumber \\ w_{k} &=& w_{k-1} - \eta v_{k} \nonumber \end{eqnarray}\end{split}\]

The term \(v\) is the velocity: It accumulates past gradients through a weighted moving average calculation. The parameters \(\mu, \eta\) are the momentum and the learning rate.

The momentum class also implements Nesterov’s Accelerated Gradient (NAG) (see Sutskever et. al.). Like momentum, NAG is a first-order optimization method with better convergence rate than gradient descent in certain situations. The NAG update can be written as:

\[\begin{split}\begin{eqnarray} v_{k} &=& \mu v_{k-1} + \nabla f(.; w_{k-1} + \mu v_{k-1}) \nonumber \\ w_{k} &=& w_{k-1} - \eta v_{k} \nonumber \end{eqnarray}\end{split}\]

The implementation uses the simplification presented by Bengio et. al.

__init__(vc, momentum=0.9, nesterov=False)[source]¶

Constructor for momentum optimizer class.

Parameters

vc (objax.variable.VarCollection) – collection of variables to optimize.
momentum (float) – the momentum hyperparameter.
nesterov (bool) – bool indicating whether to use the Nesterov method.

__call__(lr, grads)[source]¶

Updates variables and other state based on momentum (or Nesterov) SGD.

Parameters

lr (float) – the learning rate.
grads (List[jax.numpy.lax_numpy.ndarray]) – the gradients to apply.

class objax.optimizer.SGD(vc)[source]¶

Stochastic Gradient Descent (SGD) optimizer.

The stochastic gradient optimizer performs Stochastic Gradient Descent (SGD). It uses the following update rule for a loss \(f\) parameterized with model weights \(w\) and a user provided learning rate \(\eta\):

\[w_k = w_{k-1} - \eta\nabla f(.; w_{k-1})\]

__init__(vc)[source]¶

Constructor for SGD optimizer.

Parameters: vc (objax.variable.VarCollection) – collection of variables to optimize.

__call__(lr, grads)[source]¶

Updates variables based on SGD algorithm.

Parameters

lr (float) – the learning rate.
grads (List[Union[jax.numpy.lax_numpy.ndarray, jax.interpreters.xla.DeviceArray, jax.interpreters.pxla.ShardedDeviceArray]]) – the gradients to apply.