Introductory Video





  • The Machine Learning course includes several programming assignments which you’ll need to finish to complete the course. The assignments require the Octave scientific computing language.


  • Octave is a free, open-source application available for many platforms. It has a text interface and an experimental graphical one. Octave is distributed under the GNU Public License, which means that it is always free to download and distribute.


  • Use Download to install Octave for windows. "Warning: Do not install Octave 4.0.0";


  • Installing Octave on GNU/Linux : On Ubuntu, you can use: sudo apt-get update && sudo apt-get install octave. On Fedora, you can use: sudo yum install octave-forge






  • In this exercise, you will implement linear regression and get to see it work on data.


  • To get started with the exercise, you will need to download the Download and unzip its contents to the directory where you wish to complete the exercise.


  • If needed, use the cd command in Octave/MATLAB to change to this directory before starting this exercise.


  • You can also find instructions for installing Octave/MATLAB in the “Environment Setup Instructions” of the course website.






  • ex1 multi.m - Octave/MATLAB script for the later parts of the exercise


  • ex1data2.txt - Dataset for linear regression with multiple variables


  • warmUpExercise.m - Simple example function in Octave/MATLAB


  • plotData.m - Function to display the dataset


  • computeCostMulti.m - Cost function for multiple variables


  • gradientDescentMulti.m - Gradient descent for multiple variables


  • featureNormalize.m - Function to normalize features


  • normalEqn.m - Function to compute the normal equations


  • Throughout the exercise, you will be using the script ex1 multi.m. These scripts set up the dataset for the problems and make calls to functions that you will write. You do not need to modify either of them. You are only required to modify functions in other files, by following the instructions in this assignment.


  • For this programming exercise, you are only required to complete the first part of the exercise to implement linear regression with one variable here.






  • The ex1 multi.m script will start by loading and displaying some values from this dataset. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.


  • Your task here is to complete the code in featureNormalize.m to


    1. Subtract the mean value of each feature from the dataset.


    2. After subtracting the mean, additionally scale (divide) the feature values by their respective \standard deviations."


  • The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within ± 2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).


  • In Octave/MATLAB, you can use the "std" function to compute the standard deviation.


  • For example, inside featureNormalize.m, the quantity X(:,1) contains all the values of x1 (house sizes) in the training set, so std(X(:,1)) computes the standard deviation of the house sizes.


  • At the time that featureNormalize.m is called, the extra column of 1's corresponding to x0 = 1 has not yet been added to X (see ex1_multi.m for details).


  • You will do this for all the features and your code should work with datasets of all sizes (any number of features / examples). Note that each column of the matrix X corresponds to one feature.


  • Implementation Note: When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations.


  • After learning the parameters from the model, we often want to predict the prices of houses we have not seen before.


  • Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.






  • You can use the mean() and sigma() functions to get the mean and std deviation for each column of X. These are returned as row vectors (1 x n)


  • Now you want to apply those values to each element in every row of the X matrix. One way to do this is to duplicate these vectors for each row in X, so they're the same size.


  • One method to do this is to create a column vector of all-ones - size (m x 1) - and multiply it by the mu or sigma row vector (1 x n). Dimensionally, (m x 1) * (1 x n) gives you a (m x n) matrix, and every row of the resulting matrix will be identical.


  • Now that X, mu, and sigma are all the same size, you can use element-wise operators to compute X_normalized.


  • Try these commands in your workspace:


  • X = [1 2 3; 4 5 6]        % creates a test matrix
    mu = mean(X)              % returns a row vector
    sigma = std(X)            % returns a row vector
    m = size(X, 1)            % returns the number of rows in X
    mu_matrix = ones(m, 1) * mu  
    sigma_matrix = ones(m, 1) * sigma
    

  • Now you can subtract the mu matrix from X, and divide element-wise by the sigma matrix, and arrive at X_normalized.






  •   
    % ---------------
    [Xn mu sigma] = featureNormalize([1 ; 2 ; 3])
    
    % result
    
    Xn =
      -1
       0
       1
    
    mu =  2
    sigma =  1
    
    %----------------
    [Xn mu sigma] = featureNormalize(magic(3))
    
    % result
    
    Xn =
       1.13389  -1.00000   0.37796
      -0.75593   0.00000   0.75593
      -0.37796   1.00000  -1.13389
    
    mu =
       5   5   5
    sigma =
       2.6458   4.0000   2.6458
    
    %--------------
    [Xn mu sigma] = featureNormalize([-ones(1,3); magic(3)])
    
    % results
    
    Xn =
      -1.21725  -1.01472  -1.21725
       1.21725  -0.56373   0.67625
      -0.13525   0.33824   0.94675
       0.13525   1.24022  -0.40575
    
    mu =
       3.5000   3.5000   3.5000
    
    sigma =
       3.6968   4.4347   3.6968
    
    





  • Previously, you implemented gradient descent on a univariate regression problem.


  • The only difference now is that there is one more feature in the matrix X.


  • The hypothesis function and the batch gradient descent update rule remain unchanged.


  • You should complete the code in computeCostMulti.m and gradientDescentMulti.m to implement the cost function and gradient descent for linear regression with multiple variables.


  • If your code in the previous part (single variable) already supports multiple variables, you can use it here too.


  • Make sure your code supports any number of features and is well-vectorized.


  • You can use `size(X, 2)' to find out how many features are present in the dataset.






  • With a text editor (NOT a word processor), open up the computeCost.m file. Scroll down until you find the "====== YOUR CODE HERE =====" section. Below this section is where you're going to add your lines of code. Just skip over the lines that start with the '%' sign - those are instructive comments.


  • The first line of code will compute a vector 'h' containing all of the hypothesis values - one for each training example (i.e. for each row of X).


  • The hypothesis (also called the prediction) is simply the product of X and theta. So your first line of code is...


  • h = {multiply X and theta, in the proper order that the inner dimensions match}
    

  • Since X is size (m x n) and theta is size (n x 1), you arrange the order of operators so the result is size (m x 1).


  • The second line of code will compute the difference between the hypothesis and y - that's the error for each training example. Difference means subtract.


  • error = {the difference between h and y}
    

  • The third line of code will compute the square of each of those error terms (using element-wise exponentiation),


  • An example of using element-wise exponentiation - try this in your workspace command line so you see how it works.


  • v = [-2 3]
    v_sqr = v.^2
    

  • So, now you should compute the squares of the error terms:


  • error_sqr = {use what you have learned}
    

  • Next, here's an example of how the sum function works (try this from your command line)


  • q = sum([1 2 3])
    

  • Now, we'll finish the last two steps all in one line of code. You need to compute the sum of the error_sqr vector, and scale the result (multiply) by 1/(2*m). That completed sum is the cost value J.


  • J = {multiply 1/(2*m) times the sum of the error_sqr vector}
     





  •   
    X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
    y = [2 ; 5 ; 5 ; 6];
    theta_test = [0.4 ; 0.6 ; 0.8];
    computeCostMulti( X, y, theta_test )
    % result
    ans =  5.2950
     









  • Perform all of these steps within the provided for-loop from 1 to the number of iterations. Note that the code template provides you this for-loop - you only have to complete the body of the for-loop. The steps below go immediately below where the script template says "======= YOUR CODE HERE ======".


  • The hypothesis is a vector, formed by multiplying the X matrix and the theta vector. X has size (m x n), and theta is (n x 1), so the product is (m x 1). That's good, because it's the same size as 'y'. Call this hypothesis vector 'h'.


  • The "errors vector" is the difference between the 'h' vector and the 'y' vector.


  • The change in theta (the "gradient") is the sum of the product of X and the "errors vector", scaled by alpha and 1/m. Since X is (m x n), and the error vector is (m x 1), and the result you want is the same size as theta (which is (n x 1), you need to transpose X before you can multiply it by the error vector.


  • The vector multiplication automatically includes calculating the sum of the products.


  • When you're scaling by alpha and 1/m, be sure you use enough sets of parenthesis to get the factors correct.


  • Subtract this "change in theta" from the original value of theta. A line of code like this will do it:


  • theta = theta - theta_change;
    





  • gradientDescentMulti() w/ zeros for initial_theta


  • X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
    y = [2 ; 5 ; 5 ; 6];
    [theta J_hist] = gradientDescentMulti(X, y, zeros(3,1), 0.01, 10);
    
    % results
    
    >> theta
    theta =
    
       0.25175
       0.53779
       0.32282
    
    >> J_hist
    J_hist =
    
       2.829855
       0.825963
       0.309163
       0.150847
       0.087853
       0.055720
       0.036678
       0.024617
       0.016782
       0.011646
    
    >>
    


  • gradientDescentMulti() with non-zero initial_theta


  • X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
    y = [2 ; 5 ; 5 ; 6];
    [theta J_hist] = gradientDescentMulti(X, y, [0.1 ; -0.2 ; 0.3], 0.01, 10);
    
    % results
    >> theta
    theta =
    
       0.18556
       0.50436
       0.40137
    
    >> J_hist
    J_hist =
    
       3.632547
       1.766095
       1.021517
       0.641008
       0.415306
       0.272296
       0.179384
       0.118479
       0.078429
       0.052065
    
    >>