Introduction The aim is to introduce you to data visualization with Python as concrete and as consistent as possible.
Speaking of consistency, because there is no best data visualization library available for Python - up to creating these labs - we have to introduce different libraries and show their benefits when we are discussing new visualization concepts.
Doing so, we hope to make students well-rounded with visualization libraries and concepts so that they are able to judge and decide on the best visualitzation technique and tool for a given problem and audience.
Please make sure that you have completed the prerequisites for this course, namely Python for Data Science and Data Analysis with Python, which are part of this specialization.
Note: The majority of the plots and visualizations will be generated using data stored in pandas dataframes. Therefore, in this lab, we provide a brief crash course on pandas.
However, if you are interested in learning more about the pandas library, detailed description and explanation of how to use it and how to clean, munge, and process data stored in a pandas dataframe are provided in our course Data Analysis with Python, which is also part of this specialization.
pandas is an essential data analysis toolkit for Python.
From their website: pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
The course heavily relies on pandas for data wrangling, analysis, and visualization.
We encourage you to spend some time and familizare yourself with the pandas API Reference: http://pandas.pydata.org/pandas-docs/stable/api.html.
Dataset Source: International migration flows to and from selected countries - The 2015 revision.
The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination.
The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals.
The current version presents data pertaining to 45 countries.
In this lab, we will focus on the Canadian immigration data.
The first thing we'll do is import two key data analysis modules: pandas and Numpy.
import numpy as np # useful for many scientific computing in Python
import pandas as pd # primary data structure library
Let's download and import our primary Canadian Immigration dataset using pandas read_excel() method.
Normally, before we can do that, we would need to download a module which pandas requires to read in excel files. This module is xlrd.
You would need to run the following line of code to install the xlrd module:
!conda install -c anaconda xlrd --yes
Now we are ready to read in our data.
df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
print ('Data read into a pandas dataframe!')
Let's view the top 5 rows of the dataset using the head() function.
df_can.head()
# tip: You can specify the number of rows you'd like to see as follows: df_can.head(10)
We can also view the bottom 5 rows of the dataset using the tail() function.
df_can.tail()
When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.
df_can.info()
To get the list of column headers we can call upon the dataframe's .columns parameter.
df_can.columns.values
Similarly, to get the list of indicies we use the .index parameter.
df_can.index.values
Note: The default type of index and columns is NOT list.
print(type(df_can.columns))
print(type(df_can.index))
To get the index and columns as lists, we can use the tolist() method.
df_can.columns.tolist()
df_can.index.tolist()
print (type(df_can.columns.tolist()))
print (type(df_can.index.tolist()))
To view the dimensions of the dataframe, we use the .shape parameter.
# size of dataframe (rows, columns)
df_can.shape
Note: The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz] (in >= 0.17.0), timedelta[ns], category (in >= 0.15.0), and object (string). In addition these dtypes have item sizes, e.g. int64 and int32.
Let's clean the data set to remove a few unnecessary columns. We can use pandas drop() method as follows:
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)
Let's rename the columns so that they make sense. We can use rename() method by passing in a dictionary of old and new names as follows:
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns
We will also add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013, as follows:
df_can['Total'] = df_can.sum(axis=1)
We can check to see how many null objects we have in the dataset as follows:
df_can.isnull().sum()
Finally, let's view a quick summary of each column in our dataframe using the describe() method.
df_can.describe()