Using an appropriate scale to pick hyperparameters - UPSCFEVER

In the last section, you saw how sampling at random, over the range of hyperparameters, can allow you to search over the space of hyperparameters more efficiently.

But it turns out that sampling at random doesn't mean sampling uniformly at random, over the range of valid values.

Instead, it's important to pick the appropriate scale on which to explore the hyperparamaters.

In this section, I want to show you how to do that. Let's say that you're trying to choose the number of hidden units, n[l], for a given layer l.

And let's say that you think a good range of values is somewhere from 50 to 100. In that case, if you look at the number line from 50 to 100, maybe picking some number values at random within this number line.

There's a pretty visible way to search for this particular hyperparameter. Or if you're trying to decide on the number of layers in your neural network, we're calling that capital L.

Maybe you think the total number of layers should be somewhere between 2 to 4. Then sampling uniformly at random, along 2, 3 and 4, might be reasonable.

Or even using a grid search, where you explicitly evaluate the values 2, 3 and 4 might be reasonable.

So these were a couple examples where sampling uniformly at random over the range you're contemplating, might be a reasonable thing to do.

But this is not true for all hyperparameters.

Let's look at another example. Say your searching for the hyperparameter alpha, the learning rate. And let's say that you suspect 0.0001 might be on the low end, or maybe it could be as high as 1. Now if you draw the number line from 0.0001 to 1, and sample values uniformly at random over this number line.

Well about 90% of the values you sample would be between 0.1 and 1. So you're using 90% of the resources to search between 0.1 and 1, and only 10% of the resources to search between 0.0001 and 0.1.

So that doesn't seem right.

Instead, it seems more reasonable to search for hyperparameters on a log scale. Where instead of using a linear scale, you'd have 0.0001 here, and then 0.001, 0.01, 0.1, and then 1.

And you instead sample uniformly, at random, on this type of logarithmic scale. Now you have more resources dedicated to searching between 0.0001 and 0.001, and between 0.001 and 0.01, and so on. So in Python, the way you implement this is let r = -4 * np.random.rand().

And then a randomly chosen value of alpha which would be alpha = 10^r

So after this first line, r will be a random number between -4 and 0. And so alpha here will be between 10^-4 and 10⁰.

In a more general case, if you're trying to sample between 10^a, to 10^b, on the log scale.

What you do, is then sample r uniformly, at random, between a and b. So in this case, r would be between -4 and 0.

And you can set alpha, on your randomly sampled hyperparameter value, as 10^r

So just to recap, to sample on the log scale, you take the low value, take logs to figure out what is a. Take the high value, take a log to figure out what is b. So now you're trying to sample, from 10^a to the 10^b, on a log scale. So you set r uniformly, at random, between a and b.

And then you set the hyperparameter to be 10^r. So that's how you implement sampling on this logarithmic scale.