• Why normalization? Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1


  • Example To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height"


  • Target: would like to Normalize those variables so their value ranges from 0 to 1.


  • Approach: replace original value by (original value)/(maximum value)


  • 
    # replace (original value) by (original value)/(maximum value)
    df['length'] = df['length']/df['length'].max()
    df['width'] = df['width']/df['width'].max()
     
    
    
    # Write your code below and press Shift+Enter to execute 
    df['height'] = df['height']/df['height'].max()
     
    
  • Here we can see, we've normalized "length", "width" and "height" in the range of [0,1].


  • Why binning? Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.


  • Example: In our dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values.


  • What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?


  • We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins


  • Example of Binning Data In Pandas


  • Convert data to correct format


  • 
    df["horsepower"]=df["horsepower"].astype(int, copy=True)
     
    
  • Lets plot the histogram of horspower, to see what the distribution of horsepower looks like.


  • 
    %matplotlib inline
    import matplotlib as plt
    from matplotlib import pyplot
    plt.pyplot.hist(df["horsepower"])
    
    # set x/y labels and plot title
    plt.pyplot.xlabel("horsepower")
    plt.pyplot.ylabel("count")
    plt.pyplot.title("horsepower bins")
     
    
  • We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated function.


  • Since we want to include the minimum value of horsepower we want to set start_value=min(df["horsepower"]).


  • Since we want to include the maximum value of horsepower we want to set end_value=max(df["horsepower"]).


  • Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated=4.


  • We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins.


  • 
    bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
    bins
     
    
  • We set group names:


  • 
    group_names = ['Low', 'Medium', 'High']
     
    
  • We apply the function "cut" the determine what each value of "df['horsepower']" belongs to.


  • 
    df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
    
    
    df[['horsepower','horsepower-binned']].head(20)
     
    
  • Lets see the number of vehicles in each bin.


  • 
    df["horsepower-binned"].value_counts()
     
    
  • Lets plot the distribution of each bin.


  • 
    %matplotlib inline
    import matplotlib as plt
    from matplotlib import pyplot
    pyplot.bar(group_names, df["horsepower-binned"].value_counts())
    
    
    # set x/y labels and plot title
    plt.pyplot.xlabel("horsepower")
    plt.pyplot.ylabel("count")
    plt.pyplot.title("horsepower bins")
    
     
    
  • Check the dataframe above carefully, you will find the last column provides the bins for "horsepower" with 3 categories ("Low","Medium" and "High").


  • We successfully narrow the intervals from 57 to 3!


  • Normally, a histogram is used to visualize the distribution of bins we created above.


  • 
    
    %matplotlib inline
    import matplotlib as plt
    from matplotlib import pyplot
    
    
    a = (0,1,2)
    
    
    # draw historgram of attribute "horsepower" with bins = 3
    plt.pyplot.hist(df["horsepower"], bins = 3)
    
    
    # set x/y labels and plot title
    plt.pyplot.xlabel("horsepower")
    plt.pyplot.ylabel("count")
    plt.pyplot.title("horsepower bins")
     
    
  • What is an indicator variable? An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.


  • Why we use indicator variables? So we can use categorical variables for regression analysis in the later modules.


  • Example We see the column "fuel-type" has two unique values, "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" into indicator variables.


  • We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type.


  • 
    df.columns
     
    
  • Get indicator variables and assign it to data frame "dummy_variable_1"


  • 
    dummy_variable_1 = pd.get_dummies(df["fuel-type"])
    dummy_variable_1.head()
     
    
    
    #change column names for clarity
    
    dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)
    dummy_variable_1.head()
     
    
  • We now have the value 0 to represent "gas" and 1 to represent "diesel" in the column "fuel-type". We will now insert this column back into our original dataset.


  • 
    # merge data frame "df" and "dummy_variable_1" 
    df = pd.concat([df, dummy_variable_1], axis=1)
    
    
    # drop original column "fuel-type" from "df"
    df.drop("fuel-type", axis = 1, inplace=True)
    df.head()
    
    # get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
    dummy_variable_2 = pd.get_dummies(df['aspiration'])
    
    # change column names for clarity
    dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
    
    # show first 5 instances of data frame "dummy_variable_1"
    dummy_variable_2.head()