Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
Which of these statements about mini-batch gradient descent do you agree with?
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
Which of the following statements about Adam is False?
Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1], b[1], W[L], b[L]). Which of the following techniques could help find parameter values that attain a small value for\mathcal{J}J?