In this lab, we will continue exploring the Matplotlib library and will learn how to create additional plots, namely area plots, histograms, and bar charts.

Toolkits: The course heavily relies on pandas and Numpy for data wrangling, analysis, and visualization.

The primary plotting library that we are exploring in the course is Matplotlib.

Dataset: Immigration to Canada from 1980 to 2013 - International migration flows to and from selected countries - The 2015 revision from United Nation's website.

The dataset contains annual data on the flows of international migrants as recorded by the countries of destination.

The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals.

For this lesson, we will focus on the Canadian Immigration data

Import Primary Modules. The first thing we'll do is import two key data analysis modules: pandas and Numpy.


import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

Let's download and import our primary Canadian Immigration dataset using pandas read_excel() method.

Normally, before we can do that, we would need to download a module which pandas requires to read in excel files. This module is xlrd.

You would need to run the following line of code to install the xlrd module:


!conda install -c anaconda xlrd --yes

Download the dataset and read it into a pandas dataframe.


df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
                       sheet_name='Canada by Citizenship',
                       skiprows=range(20),
                       skipfooter=2
                      )

print('Data downloaded and read into a dataframe!')

Let's take a look at the first five items in our dataset.


df_can.head()

Let's find out how many entries there are in our dataset.


# print the dimensions of the dataframe
print(df_can.shape)

Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations. Refer to Introduction to Matplotlib and Line Plots lab for the rational and detailed description of the changes.

Clean up the dataset to remove columns that are not informative to us for visualization (eg. Type, AREA, REG).


df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)


# let's view the first five elements and see how the dataframe was changed
df_can.head()

Notice how the columns Type, Coverage, AREA, REG, and DEV got removed from the dataframe.

Rename some of the columns so that they make sense.


df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)


# let's view the first five elements and see how the dataframe was changed
df_can.head()

Notice how the column names now make much more sense, even to an outsider.

For consistency, ensure that all column labels of type string.


# let's examine the types of the column labels
all(isinstance(column, str) for column in df_can.columns)

Notice how the above line of code returned False when we tested if all the column labels are of type string. So let's change them all to string type.


df_can.columns = list(map(str, df_can.columns))


# let's check the column labels types now
all(isinstance(column, str) for column in df_can.columns)

Set the country name as index - useful for quickly looking up countries using .loc method.


df_can.set_index('Country', inplace=True)


# let's view the first five elements and see how the dataframe was changed
df_can.head()

Notice how the country names now serve as indices.

Add total column.

 
df_can['Total'] = df_can.sum(axis=1)


# let's view the first five elements and see how the dataframe was changed
df_can.head()

Now the dataframe has an extra column that presents the total number of immigrants from each country in the dataset from 1980 - 2013. So if we print the dimension of the data, we get:


print ('data dimensions:', df_can.shape)

So now our dataframe has 38 columns instead of 37 columns that we had before.


# finally, let's create a list of years from 1980 - 2013
# this will come in handy when we start plotting the data
years = list(map(str, range(1980, 2014)))


years



Import Matplotlib and Numpy.

# use the inline backend to generate the plots within the browser
%matplotlib inline 


import matplotlib as mpl
import matplotlib.pyplot as plt


mpl.style.use('ggplot') # optional: for ggplot-like style


# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

In the last module, we created a line plot that visualized the top 5 countries that contribued the most immigrants to Canada from 1980 to 2013.

With a little modification to the code, we can visualize this plot as a cumulative plot, also knows as a Stacked Line Plot or Area plot.


df_can.sort_values(['Total'], ascending=False, axis=0, inplace=True)


# get the top 5 entries
df_top5 = df_can.head()


# transpose the dataframe
df_top5 = df_top5[years].transpose() 


df_top5.head()

Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN values will defaulted to 0). To produce an unstacked plot, pass stacked=False.


df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='area', 
             stacked=False,
             figsize=(20, 10), # pass a tuple (x, y) size
             )


plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')


plt.show()

The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the alpha parameter.


df_top5.plot(kind='area', 
             alpha=0.25, # 0-1, default value a= 0.5
             stacked=False,
             figsize=(20, 10),
            )


plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')


plt.show()

There are two styles/options of ploting with matplotlib. Plotting using the Artist layer and plotting using the scripting layer.

*Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt' *

You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(...) to add title or plt.xlabel(...) to add label to the x-axis.


    # Option 1: This is what we have been using so far
    df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10)) 
    plt.title('Immigration trend of top 5 countries')
    plt.ylabel('Number of immigrants')
    plt.xlabel('Years')

*Option 2: Artist layer (Object oriented method) - using an Axes instance from Matplotlib (preferred) *

You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding "set_" to the previous methods).

For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots, as you will see later).

In this course, we will stick to the scripting layer, except for some advanced visualizations where we will need to use the artist layer to manipulate advanced aspects of the plots.


# option 2: preferred option with more flexibility
ax = df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10))


ax.set_title('Immigration Trend of Top 5 Countries')
ax.set_ylabel('Number of Immigrants')
ax.set_xlabel('Years')


 # get the 5 countries with the least contribution
df_least5 = df_can.tail(5)



 # transpose the dataframe
df_least5 = df_least5[years].transpose() 
df_least5.head()



df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting
df_least5.plot(kind='area', alpha=0.45, figsize=(20, 10)) 



plt.title('Immigration Trend of 5 Countries with Least Contribution to Immigration')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')



plt.show()


     # get the 5 countries with the least contribution
df_least5 = df_can.tail(5)



# transpose the dataframe
df_least5 = df_least5[years].transpose() 
df_least5.head()


df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting


ax = df_least5.plot(kind='area', alpha=0.55, stacked=False, figsize=(20, 10))


ax.set_title('Immigration Trend of 5 Countries with Least Contribution to Immigration')
ax.set_ylabel('Number of Immigrants')
ax.set_xlabel('Years')

Basic and Specialized Visualization Tools

Downloading and Prepping Data

Visualizing Data using Matplotlib

Area Plots

Area Plots - Two types of plotting

Question: Use the scripting layer to create a stacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.45.

Question: Use the artist layer to create an unstacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.55.