• Dataset: Immigration to Canada from 1980 to 2013 - International migration flows to and from selected countries - The 2015 revision from United Nation's website.


  • The dataset contains annual data on the flows of international migrants as recorded by the countries of destination.


  • The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals.


  • For this lesson, we will focus on the Canadian Immigration data


  • Import Primary Modules. The first thing we'll do is import two key data analysis modules: pandas and Numpy.


  • 
    import numpy as np  # useful for many scientific computing in Python
    import pandas as pd # primary data structure library
     
    
  • Let's download and import our primary Canadian Immigration dataset using pandas read_excel() method.


  • Normally, before we can do that, we would need to download a module which pandas requires to read in excel files. This module is xlrd.


  • You would need to run the following line of code to install the xlrd module:


  • 
    !conda install -c anaconda xlrd --yes
     
    
  • Download the dataset and read it into a pandas dataframe.


  • 
    df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
                           sheet_name='Canada by Citizenship',
                           skiprows=range(20),
                           skipfooter=2
                          )
    
    print('Data downloaded and read into a dataframe!')
     
    
  • Let's take a look at the first five items in our dataset.


  • 
    df_can.head()
     
    
  • Let's find out how many entries there are in our dataset.


  • 
    # print the dimensions of the dataframe
    print(df_can.shape)
     
    
  • Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations. Refer to Introduction to Matplotlib and Line Plots lab for the rational and detailed description of the changes.


  • Clean up the dataset to remove columns that are not informative to us for visualization (eg. Type, AREA, REG).


  • 
    df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)
     
    
    
    # let's view the first five elements and see how the dataframe was changed
    df_can.head()
     
    
  • Notice how the columns Type, Coverage, AREA, REG, and DEV got removed from the dataframe.


  • Rename some of the columns so that they make sense.


  • 
    df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)
     
    
    
    # let's view the first five elements and see how the dataframe was changed
    df_can.head()
     
    
  • Notice how the column names now make much more sense, even to an outsider.


  • For consistency, ensure that all column labels of type string.


  • 
    # let's examine the types of the column labels
    all(isinstance(column, str) for column in df_can.columns)
     
    
  • Notice how the above line of code returned False when we tested if all the column labels are of type string. So let's change them all to string type.


  • 
    df_can.columns = list(map(str, df_can.columns))
    
    
    # let's check the column labels types now
    all(isinstance(column, str) for column in df_can.columns)
     
    
  • Set the country name as index - useful for quickly looking up countries using .loc method.


  • 
    df_can.set_index('Country', inplace=True)
     
    
    
    # let's view the first five elements and see how the dataframe was changed
    df_can.head()
     
    
  • Notice how the country names now serve as indices.


  • Add total column.


  •  
    df_can['Total'] = df_can.sum(axis=1)
    
    
    # let's view the first five elements and see how the dataframe was changed
    df_can.head()
     
    
  • Now the dataframe has an extra column that presents the total number of immigrants from each country in the dataset from 1980 - 2013. So if we print the dimension of the data, we get:


  • 
    print ('data dimensions:', df_can.shape)
     
    
  • So now our dataframe has 38 columns instead of 37 columns that we had before.


  • 
    # finally, let's create a list of years from 1980 - 2013
    # this will come in handy when we start plotting the data
    years = list(map(str, range(1980, 2014)))
    
    
    years
     
    
  • A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.


  • To create a bar plot, we can pass one of two arguments via kind parameter in plot():


    1. kind=bar creates a vertical bar plot


    2. kind=barh creates a horizontal bar plot


  • Vertical bar plot In vertical bar graphs, the x-axis is used for labelling, and the length of bars on the y-axis corresponds to the magnitude of the variable being measured. Vertical bar graphs are particuarly useful in analyzing time series data. One disadvantage is that they lack space for text labelling at the foot of each bar.


  • Let's start off by analyzing the effect of Iceland's Financial Crisis: The 2008 - 2011 Icelandic Financial Crisis was a major economic and political event in Iceland. Relative to the size of its economy, Iceland's systemic banking collapse was the largest experienced by any country in economic history. The crisis led to a severe economic depression in 2008 - 2011 and significant political unrest.


  • Question: Let's compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013.


  • 
    # step 1: get the data
    df_iceland = df_can.loc['Iceland', years]
    df_iceland.head()
    # step 1: get the data
    df_iceland = df_can.loc['Iceland', years]
    df_iceland.head()
     
    
    
    # step 2: plot data
    df_iceland.plot(kind='bar', figsize=(10, 6))
    
    plt.xlabel('Year') # add to x-label to the plot
    plt.ylabel('Number of immigrants') # add y-label to the plot
    plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot
    
    plt.show()
     
    
  • The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008.


  • Let's annotate this on the plot using the annotate method of the scripting layer or the pyplot interface. We will pass in the following parameters:


    1. s: str, the text of annotation.


    2. xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).


    3. xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).


    4. xycoords: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).


    5. arrowprops: Takes a dictionary of properties to draw the arrow:


    6. arrowstyle: Specifies the arrow style, '->' is standard arrow.


    7. connectionstyle:Specifies the connection type. arc3 is a straight line.


    8. color: Specifes color of arror.


    9. lw: Specifies the line width.


    
    df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) # rotate the bars by 90 degrees
    
    plt.xlabel('Year')
    plt.ylabel('Number of Immigrants')
    plt.title('Icelandic Immigrants to Canada from 1980 to 2013')
    
    # Annotate arrow
    plt.annotate('',                      # s: str. Will leave it blank for no text
                 xy=(32, 70),             # place head of the arrow at point (year 2012 , pop 70)
                 xytext=(28, 20),         # place base of the arrow at point (year 2008 , pop 20)
                 xycoords='data',         # will use the coordinate system of the object being annotated 
                 arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)
                )
    
    plt.show()
     
    
  • Let's also annotate a text to go over the arrow. We will pass in the following additional parameters:


    1. rotation: rotation angle of text in degrees (counter clockwise)


    2. va: vertical alignment of text [‘center’ | ‘top’ | ‘bottom’ | ‘baseline’]


    3. ha: horizontal alignment of text [‘center’ | ‘right’ | ‘left’]


    
    df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) 
    
    plt.xlabel('Year')
    plt.ylabel('Number of Immigrants')
    plt.title('Icelandic Immigrants to Canada from 1980 to 2013')
    
    # Annotate arrow
    plt.annotate('',                      # s: str. will leave it blank for no text
                 xy=(32, 70),             # place head of the arrow at point (year 2012 , pop 70)
                 xytext=(28, 20),         # place base of the arrow at point (year 2008 , pop 20)
                 xycoords='data',         # will use the coordinate system of the object being annotated 
                 arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)
                )
    
     
    
    
    # Annotate Text
    plt.annotate('2008 - 2011 Financial Crisis', # text to display
                 xy=(28, 30),                    # start the text at at point (year 2008 , pop 30)
                 rotation=72.5,                  # based on trial and error to match the arrow
                 va='bottom',                    # want the text to be vertically 'bottom' aligned
                 ha='left',                      # want the text to be horizontally 'left' algned.
                )
    
    plt.show()
     
    
  • Horizontal Bar Plot Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categetorical variables.


  • Question: Using the scripting layter and the df_can dataset, create a horizontal bar plot showing the total number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. Label each country with the total immigrant count.

  • Step 1: Get the data pertaining to the top 15 countries.


  • 
    ### type your answer here
    df_can.sort_values(by='Total', ascending=True, inplace=True)
    df_top15 = df_can['Total'].tail(15)
    df_top15
     
    
  • Step 2: Plot data: Use kind='barh' to generate a bar chart with horizontal bars. Make sure to choose a good size for the plot and to label your axes and to give the plot a title.


  • Loop through the countries and annotate the immigrant population using the anotate function of the scripting interface.


  • 
    ### type your answer here
    df_top15.plot(kind='barh', figsize=(12, 12), color='steelblue')
    plt.xlabel('Number of Immigrants')
    plt.title('Top 15 Countries Contributing to the Immigration to Canada between 1980 - 2013')
    
    # annotate value labels to each country
    for index, value in enumerate(df_top15): 
        label = format(int(value), ',') # format int with commas
        
        # place text at the end of bar (subtracting 47000 from x, and 0.1 from y to make it fit within the bar)
        plt.annotate(label, xy=(value - 47000, index - 0.10), color='white')
    
    plt.show()