Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that it detects a bit more robust.
Let's take a look. Let's go through an example of pooling, and then we'll talk about why you might want to do this. Suppose you have a four by four input, and you want to apply a type of pooling called max pooling.
And the output of this particular implementation of max pooling will be a two by two output.
And the way you do that is quite simple. Take your four by four input and break it into different regions and I'm going to color the four regions as follows.
And then, in the output, which is two by two, each of the outputs will just be the max from the corresponding reshaded region.
So the upper left, I guess, the max of these four numbers is nine. On upper right, the max of the blue numbers is two. Lower left, the biggest number is six, and lower right, the biggest number is three. So to compute each of the numbers on the right, we took the max over a two by two regions.
So, this is as if you apply a filter size of two because you're taking a two by two regions and you're taking a stride of two.
So, these are actually the hyperparameters of max pooling because we start from this filter size. It's like a two by two region that gives you the nine. And then, you step all over two steps to look at this region, to give you the two, and then for the next row, you step it down two steps to give you the six, and then step to the right by two steps to give you three.
So because the squares are two by two, f is equal to two, and because you stride by two, s is equal to two. So here's the intuition behind what max pooling is doing.
If you think of a four by four region as some set of features, the activations in some layer of the neural network, then a large number, it means that it's maybe detected a particular feature.
So, the upper left-hand quadrant has this particular feature. It maybe a vertical edge or maybe whisker if you trying to detect a cat.
Clearly, that feature exists in the upper left-hand quadrant.
So, what the max operates to does is really to say, if these features detected anywhere in this filter, then keep a high number.
But if this feature is not detected, so maybe this feature doesn't exist in the upper right-hand quadrant.
Then the max of all those numbers is still itself quite small.
So maybe that's the intuition behind max pooling.
But I have to admit, I think the main reason people use max pooling is because it's been found in a lot of experiments to work well, and the intuition I just described, despite it being often cited, I don't know of anyone fully knows if that is the real underlying reason. I don't have anyone knows if that's the real underlying reason that max pooling works well in ConvNets.
One interesting property of max pooling is that it has a set of hyperparameters but it has no parameters to learn.
There's actually nothing for gradient descent to learn.
Once you fix f and s, it's just a fixed computation and gradient descent doesn't change anything.
Let's go through an example with some different hyperparameters.
Here, I am going to use, sure you have a five by five input and we're going to apply max pooling with a filter size that's three by three. So f is equal to three and let's use a stride of one.
So in this case, the output size is going to be three by three.
And the formulas we had developed in the previous section for figuring out the output size for conv layer, those formulas also work for max pooling.
So, that's \( {n_W}^{l} = \left \lfloor \frac{{n_W}^{l-1} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \right \rfloor \\ {n_H}^{l} = \left \lfloor \frac{{n_H}^{l-1} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \right \rfloor \) .
That formula also works for figuring out the output size of max pooling.
But in this example, let's compute each of the elements of this three by three output.
Now, so far, I've shown max pooling on a 2D inputs. If you have a 3D input, then the outputs will have the same dimension.
So for example, if you have five by five by two, then the output will be three by three by two and the way you compute max pooling is you perform the computation we just described on each of the channels independently.
And more generally, if this was five by five by some number of channels, the output would be three by three by that same number of channels. And the max pooling computation is done independently on each of these N_C channels. So, that's max pooling.
This one is the type of pooling that isn't used very often, but I'll mention briefly which is average pooling. So it does pretty much what you'd expect which is, instead of taking the maxes within each filter, you take the average.
So these days, max pooling is used much more often than average pooling with one exception, which is sometimes very deep in a neural network. You might use average pooling to collapse your representation from say, 7 by 7 by 1,000. An average over all you get 1 by 1 by 1,000.
So just to summarize, the hyperparameters for pooling are f, the filter size and s, the stride, and maybe common choices of parameters might be f equals two, s equals two. This is used quite often and this has the effect of roughly shrinking the height and width by a factor of two
If you want, you can add an extra hyperparameter for the padding although this is very, very rarely used. When you do max pooling, usually, you do not use any padding
But for the most parts of max pooling, usually, it does not use any padding. So, the most common value of p by far is p equals zero.
So if the input image has dimensions : \( {n_W}^{l}, {n_H}^{l}, {n_C}^{l} \) then after max pooling we get \( \left \lfloor \frac{{n_W}^{l} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \right \rfloor, \left \lfloor \frac{{n_H}^{l} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \right \rfloor , {n_C}^{l} \)
So the number of input channels is equal to the number of output channels because pooling applies to each of your channels independently. One thing to note about pooling is that there are no parameters to learn. So, when we implement back -prop, you find that there are no parameters that backprop will adapt through max pooling.
Instead, there are just these hyperparameters that you set once, maybe set ones by hand or set using cross-validation. And then beyond that, you are done. Its just a fixed function that the neural network computes in one of the layers, and there is actually nothing to learn.
So, that's it for pooling. You now know how to build convolutional layers and pooling layers. In the next ection, let's see a more complex example of a ConvNet. One that will also allow us to introduce fully connected layers.