Optimizers in tensorflow with formulas

To understand optimizers in tensorflow, we firstly denote C(\theta) a cost function where \theta = (\theta_{1},...\theta_{n})^{T} represents parameters or weights in the model.

1) Gradient descent optimizer

The rule for updating parameters follow:

\theta^{t+1} = \theta^{t} - \eta \Delta C(\theta^{t}),

where \eta is is learning_rate.

The python code for gradient descent optimizer is:

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

2) Adagrad optimizer

Parameters are updated by the following rule:

\theta_{i}^{(t+1)} = \theta_{i}^{(t)} - \frac{\eta}{\sqrt{\sum_{\tau = 1}^{(t)} (\theta_{i}^{\tau })^{2} + \epsilon}} \frac{\partial \Delta C^{(t)}}{\partial \theta_{i}}

The tensorflow code for this optimizer is

optimizer = tf.train.AdagradOptimizer(learning_rate=0.001, initial_accumulator_value=0.1).minimize(cost)

3) RMSprop optimizer

RMSprop is a improvement from Rprop. Following this link, if you want to see the Rprop algorithm. Because RMSprop is designed on mini batch training, so we replace \theta notation with w_{ij}. The scheme for RMSprop is:

g_{ij}^{(t)} = \alpha g_{ij}^{(t-1)} + (1 - \alpha)  (\frac{\partial \Delta C^{(t)}}{\partial w_{ij}})^{2}

w_{ij}^{(t+1)} = w_{ij}^{(t)} -  \frac{\eta}{\sqrt{ g_{ij}^{(t) } +  \epsilon}} \frac{\partial \Delta C^{(t)}}{\partial w_{ij}}

Tensorflow supports RMSprop optimization by this following code:

optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay =0.9, momentum=0.0, epsilon=1e-10).minimize(cost)

4) Adadelta Optimizer

Adadelta opitmization is similar to RMSprop. It is beacuse the both optimizations solve the fast learning-rate decay problem of Adagrad. The rule for updating weights is:

g_{ij}^{(t)} = \gamma g_{ij}^{(t-1)} + (1 - \gamma)   (\frac{\partial \Delta C^{(t)}}{\partial w_{ij}})^{2}

w_{ij}^{(t+1)} = w_{ij}^{(t)} -  \frac{\eta}{\sqrt{  g_{ij}^{(t) } +  \epsilon}} \frac{\partial \Delta C^{(t)}}{\partial  w_{ij}}

The code pattern for Adadelta optimizer in python is

optimizer = tf.train.AdadeltaOptimizer(learning_rate=0.001, rho=0.95, epsilon = 1e-08).minimize(cost)

where rho, learning_rate, and epsilon represent \gamma, \eta, and \epsilon, respectively.

5) Adam optimizer

Adam (Adaptive Moment Estimator) has the following update rule:

m_{ij}^{(t)} = \beta_{1} m_{ij}^{(t-1)} + (1 - \beta_{1}) \frac{\partial \Delta C^{(t)}}{\partial w_{ij}}

v_{ij}^{(t)} = \beta_{2} v_{ij}^{(t-1)} + (1 - \beta_{2})    (\frac{\partial \Delta C^{(t)}}{\partial w_{ij}})^{2}

\hat{m}_{ij}^{(t)} = \frac{ m_{ij}^{(t)}}{(1 - \beta_{1}^{t})}

\hat{v}_{ij}^{(t)} = \frac{ v_{ij}^{(t)}}{(1 - \beta_{2}^{t})}

w_{ij}^{(t+1)} = w_{ij}^{(t)} -  \frac{\eta}{\sqrt{   \hat{v}_{ij}^{(t) } +  \epsilon}} \frac{\partial \Delta C^{(t)}}{\partial   w_{ij}}

The usage code for Adam optimizer is the following:

optimizer = tf.train.AdamOptimizer(learning_rate=0.001,beta1=0.9,beta2=0.999,epsilon=1e-08).minimize(cost)

6) Momentum optimizer and Nesterov algorithm

Weights with momentum optimizer are updated by the following rule:

v_{i}^{(t + 1)} = \alpha v_{i}^{(t)} -  \eta \frac{\partial \Delta C^{(t)}}{\partial w_{i}}\big (w_{i}^{(t)} \big )

w_{i}^{t + 1} = w_{i}^{(t)} + v_{i}^{(t + 1)}.

With Nesterow accelerated gradient technique, the scheme for momentum optimizer is:

\theta^{\big (t + \frac {1}{2}\big )} = \theta^{(t)} + \alpha v^{(t)}

v^{(t + 1)} = \alpha v^{(t)} -  \eta \Delta C^{(t)}\big [ \theta = \theta^{\big (t + \frac {1}{2}\big )} \big ]

\theta^{(t + 1)} = \theta^{(t)} + v^{(t + 1)}

One can use the code python for momentum optimization as following:

optimizer = tf.train.MomentumOptimizer.(learning_rate=0.001, momentum=0.9,use_nesterov=False).minimize(cost)

Leave a Reply

Your email address will not be published. Required fields are marked *