• In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

  • We need to distinguish whether bias or variance is the problem contributing to bad predictions.

  • High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

  • The training error will tend to decrease as we increase the degree d of the polynomial.

  • At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

  • High bias (underfitting): both \( J_{train}(\Theta) \) and \( J_{CV}(\Theta) \) will be high. Also, \( J_{CV}(\Theta) \approx J_{train}(\Theta) \).

  • High variance (overfitting): \( J_{train}(\Theta) \) will be low and \( J_{CV}(\Theta) \) will be much greater than \( J_{train}(\Theta) \).

  • The is summarized in the figure below:

  • bias-variance

  • bias-variance

  • In the figure above, we see that as \( \lambda \) increases, our fit becomes more rigid.

  • On the other hand, as \( \lambda \) approaches 0, we tend to over overfit the data.

  • So how do we choose our parameter \( \lambda \) to get it 'just right' ? In order to choose the model and the regularization term λ, we need to:

  • Create a list of lambdas (i.e. λ ∈ {0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});

  • Create a set of models with different degrees or any other variants.

  • Iterate through the \( \lambda \) and for each \( \lambda \) go through all the models to learn some \( \Theta \).

  • Compute the cross validation error using the learned Θ (computed with λ) on the \( J_{CV}(\Theta) \) without regularization or λ = 0.

  • Select the best combo that produces the lowest error on the cross validation set.

  • Using the best combo Θ and λ, apply it on \( J_{test}(\Theta) \) to see if it has a good generalization of the problem.

  • Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

  • As the training set gets larger, the error for a quadratic function increases.

  • The error value will plateau out after a certain m, or training set size. Experiencing high bias:

  • Low training set size: causes \( J_{train}(\Theta) \) to be low and \( J_{CV}(\Theta) \) to be high.

  • Large training set size: causes both \( J_{train}(\Theta) \) and \( J_{CV}(\Theta) \) to be high with \( J_{train}(\Theta) \sim J_{CV}(\Theta) \).

  • If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

  • bias-variance

  • Low training set size: \( J_{train}(\Theta) \) will be low and \( J_{CV}(\Theta) \) will be high.

  • Large training set size: \( J_{train}(\Theta) \) increases with training set size and \( J_{CV}(\Theta) \) continues to decrease without leveling off. Also, \( J_{train}(\Theta) < J_{CV}(\Theta) \) but the difference between them remains significant.

  • If a learning algorithm is suffering from high variance, getting more training data is likely to help.

  • bias-variance

  • Our decision process can be broken down as follows:

    1. Getting more training examples: Fixes high variance

    2. Trying smaller sets of features: Fixes high variance

    3. Adding features: Fixes high bias

    4. Adding polynomial features: Fixes high bias

    5. Decreasing λ: Fixes high bias

    6. Increasing λ: Fixes high variance.

  • Diagnosing Neural Networks

    1. A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.

    2. A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

    3. Using a single hidden layer is a good starting default.

    4. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

  • Lower-order polynomials (low model complexity) have high bias and low variance.

  • In this case, the model fits poorly consistently.

  • Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly.

  • These have low bias on the training data, but very high variance.

  • In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.