What is the "cache" used for in our implementation of forward propagation and backward propagation?

**Explanation :**Correct, the "cache" records values from the forward propagation units and sends it to the backward propagation units because it is needed to compute the chain rule derivatives.

Which ones are "hyperparameters"?

Among the following, which ones are "hyperparameters"?

Which of the following statements is true?

Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, ... ,L. True/False?

**Explanation :**Forward propagation propagates the input through the layers, although for shallow networks we may just write all the lines in a deeper network, we cannot avoid a for loop iterating over the layers.

Assume we store the values for n^{[l]}
in an array called layers, as follows: layer_dims = [n_{x}, 4,3,2,1]. So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model

Consider the following neural network.How many layers does this network have?

**Explanation :**Yes. As seen in lecture, the number of layers is counted as the number of hidden layers + 1. The input and output layers are not counted as hidden layers.

During forward propagation, in the forward function for a layer l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l, since the gradient depends on it. True/False?

**Explanation :**Yes, as you've seen in the week 3 each activation has a different derivative. Thus, during backpropagation you need to know which activation was used in the forward propagation to be able to compute the correct derivative.

There are certain functions with the following properties: (i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an exponentially smaller network. True/False?

Consider the following 2 hidden layer neural network:

**Explanation :**W

^{[l]}has shape (n^{[l]}, n^{[l-1]})

Consider the following 2 hidden layer neural network:

**Explanation :**b

^{[l]}has shape (n^{[l]}, 1)

Consider the following 2 hidden layer neural network:

**Explanation :**b

^{[l]}has shape (n^{[l]}, 1)

Consider the following 2 hidden layer neural network:

**Explanation :**b

^{[l]}has shape (n^{[l]}, 1)

Consider the following 2 hidden layer neural network:

**Explanation :**W

^{[l]}has shape (n^{[l]}, n^{[l-1]})

Consider the following 2 hidden layer neural network:

**Explanation :**W

^{[l]}has shape (n^{[l]}, n^{[l-1]})

Whereas the previous question used a specific network, in the general case what is the dimension of W^{[l]}, the weight matrix associated with layer l?