• In this section, we'll be going through some data preprocessing techniques. If you're unfamiliar with the term, data preprocessing is a necessary step in data analysis.


  • It is the process of converting or mapping data from one raw form into another format to make it ready for further analysis.


  • Data preprocessing is often called data cleaning or data wrangling, and there are likely other terms.


  • First, we'll show you how to identify and handle missing values. A missing value condition occurs whenever a data entry is left empty.


  • Then we'll cover data formats. Data from different sources maybe in various formats, in different units, or in various conventions.


  • We will introduce some methods in Python Pandas that can standardize the values into the same format, or unit, or convention.


  • After that, we'll cover data normalization. Different columns of numerical data may have very different ranges and direct comparison is often not meaningful.


  • Normalization is a way to bring all data into a similar range for more useful comparison.


  • Specifically, we'll focus on the techniques of centering and scaling. Then, we'll introduce data binning.


  • Binning creates bigger categories from a set of numerical values. It is particularly useful for comparison between groups of data.


  • Lastly, we'll talk about categorical variables and show you how to convert categorical values into numeric variables to make statistical modeling easier.


  • In Python, we usually perform operations along columns. Each row of the column represents a sample, I.e, a different used car in the database.


  • You access a column by specifying the name of the column. For example, you can access symbolling and body style. Each of these columns is a Panda series.


  • There are many ways to manipulate Dataframes in Python. For example, you can add a value to each entry off a column. To add one to each symbolling entry, use this command. This changes each value of the Dataframe column by adding one to the current value.


  • In this section, we will introduce the pervasive problem of missing values as well as strategies on what to do when you encounter missing values in your data.


  • When no data value is stored for feature for a particular observation, we say this feature has a missing value.


  • Usually missing value in data set appears as question mark and a zero or just a blank cell.


  • But how can you deal with missing data? There are many ways to deal with missing values and this is regardless of Python, R or whatever tool you use. Of course, each situation is different and should be judged differently. However, these are the typical options you can consider.


  •      
    #import libraries and files
    
    import pandas as pd
    import matplotlib.pylab as plt
    filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
    headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
             "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
             "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
             "peak-rpm","city-mpg","highway-mpg","price"]
    df = pd.read_csv(filename, names = headers)    
        
      
    
  • The first is to check if the person or group that collected the data can go back and find what the actual value should be. Another possibility is just to remove the data where that missing value is found.


  • When you drop data, you could either drop the whole variable or just the single data entry with the missing value. If you don't have a lot of observations with missing data, usually dropping the particular entry is the best.


  • If you're removing data, you want to look to do something that has the least amount of impact. Replacing data is better since no data is wasted. However, it is less accurate since we need to replace missing data with a guess of what the data should be.


  • One standard for placement technique is to replace missing values by the average value of the entire variable.


  • But what if the values cannot be averaged as with categorical variables?


  • In this case, one possibility is to try using the mode, the most common like gasoline.


  • Finally, sometimes we may find another way to guess the missing data. This is usually because the data gatherer knows something additional about the missing data. And of course, finally, in some cases you may simply want to leave the missing data as missing data. For one reason or another, it may be useful to keep that observation even if some features are missing.


  • Now, let's go into how to drop missing values or replace missing values in Python. To remove data that contains missing values Panda's library has a built-in method called dropna. Essentially, with the dropna method, you can choose to drop rows or columns that contain missing values like NaN.


  •   
     
    # To see what the data set looks like, we'll use the head() method.
    df.head()
    
     
    
  • So you'll need to specify access equal zero to drop the rows or access equals one to drop the columns that contain the missing values.


  • It can simply be done in one line of code using dataframe.dropna. Setting the argument in place to true, allows the modification to be done on the data set directly. In place equals true, just writes the result back into the data frame.


  •  
    '''
    In the car dataset, missing data comes with the question mark "?". We replace "?" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function:
    '''    
        
    import numpy as np
    
    # replace "?" to NaN
    df.replace("?", np.nan, inplace = True)
    df.head(5)  
     
    
     
    '''
    We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
    
    .isnull()
    .notnull()
    The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.
    '''
    
    
    
    missing_data = df.isnull()
    missing_data.head(5)
    
     
    
     
    #Count missing values in each column
    
    for column in missing_data.columns.values.tolist():
        print(column)
        print (missing_data[column].value_counts())
        print("")   
        
     
    
  • You should always check the documentation if you are not familiar with the function or method. The pandas web page has lots of useful resources.


  • To replace missing values like NaNs with actual values, Pandas library has a built-in method called replace which can be used to fill in the missing values with the newly calculated values.


  • As an example, assume that we want to replace the missing values of the variable by the mean value of the variable. Therefore, the missing value should be replaced by the average of the entries within that column.


  •      
    #Calculate the average of the column
    
    avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
    print("Average of normalized-losses:", avg_norm_loss)    
        
    #Replace "NaN" by mean value in "normalized-losses" column
    
    df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)    
        
     
    
  • In Python, first we calculate the mean of the column. Then we use the method replace to specify the value we would like to be replaced as the first parameter, in this case NaN.


  • The second parameter is the value we would like to replace it with i.e the mean in this example. This is a fairly simplified way of replacing missing values.


  • There are of course other techniques such as replacing missing values for the average of the group instead of the entire data set.