• In addition to L2 regularization, another very powerful regularization techniques is called "dropout."


  • Let's say you train a neural network like the one on the left and there's over-fitting.


  • Dropout Regularization


  • With dropout, what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.


  • Let's say that for each of these layers, we're going to do the following activity - for each node, toss a coin and have a 0.5 chance of keeping each node and 0.5 chance of removing each node.


  • So, after the coin tosses, maybe we'll decide to eliminate those nodes, then what you do is actually remove all the outgoing things from that no as well.


  • So you end up with a much smaller, diminished network.


  • And then you do back propagation training.


  • This is for one example on this much diminished network. And then on different examples, you would toss a set of coins again and keep a different set of nodes and then dropout or eliminate different than nodes.


  • And so for each training example, you would train it using one of these neural based networks.


  • So, maybe it seems like a slightly crazy technique. But this actually works.


  • But you can imagine that because you're training a much smaller network on each example or you end up able to regularize the network.






  • The most common implementation for dropout is technique called inverted dropout.


  • Let's say we want to illustrate this with layer l = 3. So, in the code we are going to write a few lines in python.


  • The below lines are used for illustrating how to represent dropout in a single layer.


  • In them what we are going to do is set a vector d so \( d^{[3]} \) is going to be the dropout vector for the layer 3. \( d^{[3]} \) and \( a^{[3]} \) (activation for layer 3) are going to be of the same shape.


  • Then we initialize a variable keep.prob which will be the probability that a given hidden unit will be kept. So if keep.prob = 0.8, then this means that there's a 0.2 chance of eliminating any hidden unit.


  •  
    
    Python code:
    
    d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep.prob
    
    

  • This generates a random matrix. So \( d^{[3]} \) will be a matrix.


  • In \( d^{[3]} \) the random values will be filled between (0, 1) and by using the statement < keep.prob; the values of \( d^{[3]} \) which are < 0.8 will be set to true / 1.


  • And then what you are going to do is take your activations from the third layer, \( a^{[3]} \) and you can set \( a^{[3]} \) to be equal to the old \( a^{[3]} \), times- There is element wise multiplication. Or you can also write this as a3* = d3 [Python: a3 = np.multiply(a3,d3)].


  • But what this does is for every element of d3 that's equal to zero. And there was a 20% chance of each of the elements being zero, just multiply operation ends up zeroing out, the corresponding element of \( a^{[3]} \).


  • If you do this in python, technically \( d^{[3]} \) will be a boolean array where value is true and false, rather than one and zero.


  • But the multiply operation works and will interpret the true and false values as one and zero.


  • Then finally, we're going to take a3 and scale it up by dividing by 0.8 or really dividing by our keep.prob parameter.


  • The reason we need this final step is given next. Let's say for the sake of argument that you have 50 units or 50 neurons in the third hidden layer. So maybe a3 is 50 by one dimensional.


  • So, if you have a 80% chance of keeping activations and 20% chance of eliminating them. This means that on average, you end up with 10 units shut off or 10 units zeroed out.


  • And so now, if you look at the value of \( z^{[4]} \), \( z^{[4]} \) is going to be equal to \(w^{[4]} * a^{[3]} + b^{[4]}\). And so, on expectation, this will be reduced by 20%. By which I mean that 20% of the elements of \( a^{[3]} \) will be zeroed out.


  • So, in order to not reduce the expected value of \( z^{[4]} \), what you do is take \( a^{[3]} \) and divide it by 0.8 because this will correct or just a bump \( a^{[3]} \) back up by roughly 20% that you need. So it's not changed the expected value of \( a^{[3]} \).


  • And, so this line here is what's called the inverted dropout technique. And its effect is that, no matter what you set to keep.prob to, whether it's 0.8 or 0.9 or even one, if it's set to one then there's no dropout, because it's keeping everything or 0.5 or whatever, this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of \( a^{[3]} \) remains the same.


  • And it turns out that at test time, when you trying to evaluate a neural network this inverted dropout technique makes test time easier because you have less of a scaling problem.


  • By far the most common implementation of dropouts today is inverted dropouts.


  • At test time, you're given some x or which you want to make a prediction. And using our standard notation, we are going to use \(a^{[0]}\), the activations of the zeroes layer to denote just test example x.


  • So what we're going to do is not to use dropout at test time in particular which is in a sense. \( Z^{[1]}= w^{[1]}.a^{[0]} + b^{[1]}. a^{[1]} = g^{[1]}(z^{[1]} Z). Z^{[2]} = w^{[2]}.a^{[1]} + b^{[2]}. a^{[2]} =... \) And so on. Until you get to the last layer and that you make a prediction \( \hat{y} \).


  • But notice that the test time you're not using dropout explicitly and you're not tossing coins at random, you're not flipping coins to decide which hidden units to eliminate.


  • And that's because when you are making predictions at the test time, you don't really want your output to be random. If you are implementing dropout at test time, that just add noise to your predictions.


  • In theory, one thing you could do is run a prediction process many times with different hidden units randomly dropped out and have it across them. But that's computationally inefficient and will give you roughly the same result; to this different procedure as well.


  • Also you don't need to add in an extra funny scaling parameter at test time. That's different than when you have that training time. And when you implement this in week's premier exercise, you gain more firsthand experience with it as well.