Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

a^[3]{8}(7)

a^[8]{3}(7)

a^[7]{8}(3)

Which of these statements about mini-batch gradient descent do you agree with?

Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.

If the mini-batch size is 1, you end up having to process the entire training set before making any progress.

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:

If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.

α = 0.95^tα₀

α = e^tα₀

Which of the following statements about Adam is False?

Adam combines the advantages of RMSProp and momentum

Adam should be used with batch gradient computations, not with mini-batches.

Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W^[1], b^[1], W^[L], b^[L]). Which of the following techniques could help find parameter values that attain a small value for\mathcal{J}J?

Try initializing all the weights to zero

Try mini-batch gradient descent

UPSCFEVER - POPULAR PAGES