Correct, the "cache" records values from the forward propagation units and sends it to the backward propagation units because it is needed to compute the chain rule derivatives.
Explanation :
Forward propagation propagates the input through the layers, although for shallow networks we may just write all the lines in a deeper network, we cannot avoid a for loop iterating over the layers.
Explanation :
Yes. As seen in lecture, the number of layers is counted as the number of hidden layers + 1. The input and output layers are not counted as hidden layers.
Explanation :
Yes, as you've seen in the week 3 each activation has a different derivative. Thus, during backpropagation you need to know which activation was used in the forward propagation to be able to compute the correct derivative.