Is the following true? a_{4}^{[2]} is the activation output by the 4th neuron of the 2nd layer

**Explanation :**yes

Is the following true? X is a matrix in which each column is one training example.

**Explanation :**yes

X is a matrix in which each row is one training example.

**Explanation :**false

a_{4}^{[2]} is the activation output for the 4th training example of the 2nd layer

a^{[2](12)}
denotes activation vector of the 12th layer on the 2nd
training example.

a^{[2]}
denotes the activation vector of the 2nd layer.

a^{[2](12)}
denotes the activation vector of the 2nd layer for the 12th
training example.

The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True/False?

**Explanation :**Yes. As seen in lecture the output of the tanh is between -1 and 1, it thus centers the data which makes the learning simpler for the next layer.

Which of these is a correct vectorized implementation of forward propagation for layer l, where 1 ≤ l ≤ L?

You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?

**Explanation :**Yes. Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification. You can classify as 0 if the output is less than 0.5 and classify as 1 if the output is more than 0.5. It can be done with tanh as well but it is less convenient as the output is between -1 and 1.

Consider the following code:

A = np.random.randn(4,3)

B = np.sum(A, axis = 1, keepdims = True)

What will be B.shape? (If you’re not sure, feel free to run this in python to find out).

**Explanation :**Yes, we use (keepdims = True) to make sure that A.shape is (4,1) and not (4, ). It makes our code more rigorous.

Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?

**Explanation :**No, Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the weights values follow x's distribution and are different from each other if x is not a constant vector.

You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

**Explanation :**Yes. tanh becomes flat for large values, this leads its gradient to be close to zero. This slows down the optimization algorithm.

Consider the following 1 hidden layer neural network:

W^{[1]} will have shape (2, 4)

Consider the following 1 hidden layer neural network:

b^{[1]} will have shape (4,1)

Consider the following 1 hidden layer neural network:

W^{[1]} will have shape (4,2)

Consider the following 1 hidden layer neural network:

W^{[2]} will have shape (1,4)

Consider the following 1 hidden layer neural network:

b^{[2]} will have shape (1,1)

Consider the following 1 hidden layer neural network:

Z^{[1]} will have shape (4,1)

Consider the following 1 hidden layer neural network:

A^{[1]} will have shape (4,1)