• During the history of deep learning, many researchers including some very well-known researchers, sometimes proposed optimization algorithms and showed that they worked well in a few problems.

• But those optimization algorithms subsequently were shown not to really generalize that well to the wide range of neural networks you might want to train.

• So over time, the deep learning community actually developed some amount of skepticism about new optimization algorithms.

• And a lot of people felt that gradient descent with momentum really works well, was difficult to propose things that work much better.

• So, RMS prop and the Adam optimization algorithm, which we'll discuss about in this section, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures

• So, this is one of the algorithms that we wouldn't hesitate to recommend you try because many people have tried it and seen it work well on many problems.

• And the Adam optimization algorithm is basically taking momentum and RMS prop and putting them together.

• So, let's see how that works. Below is RMS prop we saw this in the last section.

•  On iteration t:
Computing dW, dB on current mini batch
SdW = βSdW + (1 - β)dW2
SdB = βSdB + (1 - β)dB2

W := W - α dW / (√SdW    + ε)
b := b - α dB / (√SdB  + ε)



• Gradient descent with momentum: $$V_{dW} = \beta V_{dw} + (1 - \beta)dW \\ V_{db} = \beta V_{db} + (1 - \beta)db$$

• To implement Adam you would initialize: Vdw=0, Sdw=0, and similarly Vdb, Sdb=0.

• And then on iteration T, you would compute the derivatives: compute dw, db using current mini-batch.

• And then you do the momentum exponentially weighted average.

• This is shown below:

• $$V_{dW} = 0, V_{db} = 0, S_{dW} = 0, S_{db} = 0 \\\\ \text{On iteration t:} \\\\ V_{dW} = \beta_1 V_{dw} + (1 - \beta_1)dW \\\\ V_{db} = \beta_1 V_{db} + (1 - \beta_1)db \\\\ S_{dW} = \beta_2 S_{dw} + (1 - \beta_2)dW \\\\ S_{db} = \beta_2 S_{db} + (1 - \beta_2)db \\\\ V_{dW}^{corrected} = \frac{V_{dW}}{(1 - \beta_1)^t}\\\\ V_{db}^{corrected} = \frac{V_{db}}{(1 - \beta_1)^t}\\\\ S_{dW}^{corrected} = \frac{S_{dW}}{(1 - \beta_2)^t}\\\\ S_{db}^{corrected} = \frac{S_{db}}{(1 - \beta_2)^t}\\\\ W := W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\\\ b := b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}\\\\$$

• So if you're just implementing momentum you'd use vdw, vw or maybe vdw corrected.

• But now, we add in the rms prop portion of this.

• So we're also going to divide by square roots of sdw corrected plus epsilon.

• And similarly, B gets updated as a similar formula, vdb corrected, divided by square root S, corrected, db, plus epsilon.

• And so, this algorithm combines the effect of gradient descent with momentum together with gradient descent with rms prop.

• And this is a commonly used learning algorithm that is proven to be very effective for many different neural networks of a very wide variety of architectures.

• So, this algorithm has a number of hyper parameters.

• The learning rate hyper parameter alpha $$\alpha$$ is still important and usually needs to be tuned.

• So you just have to try a range of values and see what works.

• A common choice really the default choice for $$\beta_1$$ is 0.9.

• The hyper parameter for ß2, the authors of the Adam paper, inventors of the Adam algorithm recommend 0.999.

• And then Epsilon, the choice of epsilon doesn't matter very much. But the authors of the Adam paper recommended it 10 to the minus 8.

• But this parameter you really don't need to set it and it doesn't affect performance much at all.

• So, where does the term 'Adam' come from? Adam stands for Adaptive Moment Estimation. So ß1 is computing the mean of the derivatives. This is called the first moment.

• And ß2 is used to compute exponentially weighted average and that's called the second moment.

• So that gives rise to the name adaptive moment estimation. But everyone just calls it the Adam authorization algorithm.