• The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).


  • In Pandas, we use


    1. .dtype() to check the data type


    2. .astype() to change the data type


     
    
    import pandas as pd
    import matplotlib.pylab as plt
    
    filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
    
    
    headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
             "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
             "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
             "peak-rpm","city-mpg","highway-mpg","price"]
             
    df = pd.read_csv(filename, names = headers)
    
    #Lets list the data types for each column
    df.dtypes
      
    
  • As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'.


  • For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'.


  • We have to convert data types into a proper format for each column using the "astype()" method.


  •  
    df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
    df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
    df[["price"]] = df[["price"]].astype("float")
    df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
      
    
     
    
    #Let us list the columns after the conversion
    df.dtypes
      
    
  • So now, we finally obtain the cleaned dataset with no missing values and all data in its proper format.


  • Data is usually collected from different agencies with different formats. (Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)


  • What is Standardization? Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.


  • Example Transform mpg to L/100km: In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard


  • We will need to apply data transformation to transform mpg into L/100km?


  • The formula for unit conversion is L/100km = 235 / mpg


  • We can do many mathematical operations directly in Pandas.


  •      
    #Check the first few rows
    df.head()  
    
    
    # Convert mpg to L/100km by mathematical operation (235 divided by mpg)
    df['city-L/100km'] = 235/df["city-mpg"]
    
    # check your transformed data 
    df.head()