• Introduction The aim is to introduce you to data visualization with Python as concrete and as consistent as possible.


  • Speaking of consistency, because there is no best data visualization library available for Python - up to creating these labs - we have to introduce different libraries and show their benefits when we are discussing new visualization concepts.


  • Doing so, we hope to make students well-rounded with visualization libraries and concepts so that they are able to judge and decide on the best visualitzation technique and tool for a given problem and audience.


  • Please make sure that you have completed the prerequisites for this course, namely Python for Data Science and Data Analysis with Python, which are part of this specialization.


  • Note: The majority of the plots and visualizations will be generated using data stored in pandas dataframes. Therefore, in this lab, we provide a brief crash course on pandas.


  • However, if you are interested in learning more about the pandas library, detailed description and explanation of how to use it and how to clean, munge, and process data stored in a pandas dataframe are provided in our course Data Analysis with Python, which is also part of this specialization.


  • pandas is an essential data analysis toolkit for Python.


  • From their website: pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.


  • It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.


  • The course heavily relies on pandas for data wrangling, analysis, and visualization.


  • We encourage you to spend some time and familizare yourself with the pandas API Reference: http://pandas.pydata.org/pandas-docs/stable/api.html.


  • Dataset Source: International migration flows to and from selected countries - The 2015 revision.


  • The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination.


  • The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals.


  • The current version presents data pertaining to 45 countries.


  • In this lab, we will focus on the Canadian immigration data.


  • The first thing we'll do is import two key data analysis modules: pandas and Numpy.


  • 
    import numpy as np  # useful for many scientific computing in Python
    import pandas as pd # primary data structure library
      
    
  • Let's download and import our primary Canadian Immigration dataset using pandas read_excel() method.


  • Normally, before we can do that, we would need to download a module which pandas requires to read in excel files. This module is xlrd.


  • You would need to run the following line of code to install the xlrd module:


  • 
    !conda install -c anaconda xlrd --yes
    
      
    
  • Now we are ready to read in our data.


  • 
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
                           sheet_name='Canada by Citizenship',
                           skiprows=range(20),
                           skipfooter=2)
    
    
    print ('Data read into a pandas dataframe!')
      
    
  • Let's view the top 5 rows of the dataset using the head() function.


  • 
    df_can.head()
    # tip: You can specify the number of rows you'd like to see as follows: df_can.head(10) 
      
    
  • We can also view the bottom 5 rows of the dataset using the tail() function.


  • 
    df_can.tail()
      
    
  • When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.


  • 
    df_can.info()
      
    
  • To get the list of column headers we can call upon the dataframe's .columns parameter.


  • 
    df_can.columns.values 
      
    
  • Similarly, to get the list of indicies we use the .index parameter.


  • 
    df_can.index.values
      
    
  • Note: The default type of index and columns is NOT list.


  • 
    print(type(df_can.columns))
    print(type(df_can.index))
      
    
  • To get the index and columns as lists, we can use the tolist() method.


  • 
    df_can.columns.tolist()
    df_can.index.tolist()
    
    
    print (type(df_can.columns.tolist()))
    print (type(df_can.index.tolist()))
      
    
  • To view the dimensions of the dataframe, we use the .shape parameter.


  • 
    # size of dataframe (rows, columns)
    df_can.shape    
      
    
  • Note: The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz] (in >= 0.17.0), timedelta[ns], category (in >= 0.15.0), and object (string). In addition these dtypes have item sizes, e.g. int64 and int32.


  • Let's clean the data set to remove a few unnecessary columns. We can use pandas drop() method as follows:


  • 
    # in pandas axis=0 represents rows (default) and axis=1 represents columns.
    
    df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
    df_can.head(2)
    
    # in pandas axis=0 represents rows (default) and axis=1 represents columns.
    
    df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
    df_can.head(2)
      
    
  • Let's rename the columns so that they make sense. We can use rename() method by passing in a dictionary of old and new names as follows:


  • 
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
    df_can.columns
      
    
  • We will also add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013, as follows:


  • 
    df_can['Total'] = df_can.sum(axis=1)
      
    
  • We can check to see how many null objects we have in the dataset as follows:


  • 
    df_can.isnull().sum()
      
    
  • Finally, let's view a quick summary of each column in our dataframe using the describe() method.


  • 
    df_can.describe()