• Select Column There are two ways to filter on a column name:


  • Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.


  •   
        df.column_name 
            (returns series)
            
    
  • More robust, and can filter on multiple columns.


  •   
        df['column']  
            (returns series)
        df[['column 1', 'column 2']] 
            (returns dataframe)
            
    
  • Example: Let's try filtering on the list of countries ('Country').


  •   
    df_can.Country  # returns a series
     
    
  • Let's try filtering on the list of countries ('OdName') and the data for years: 1980 - 1985.


  •   
    df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe
    # notice that 'Country' is string, and the years are integers. 
    # for the sake of consistency, we will convert all column names to string later on
     
    
  • Select Row There are main 3 ways to select rows:


  •   
        df.loc[label]        
            #filters by the labels of the index/column
        df.iloc[index]       
            #filters by the positions of the index/column
             
    
  • Before we proceed, notice that the defaul index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corressponding index value.


  • This can be fixed very easily by setting the 'Country' column as the index using set_index() method.


  •   
    df_can.set_index('Country', inplace=True)
    # tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()
    df_can.head(3)
    
    
    
    # optional: to remove the name of the index
    df_can.index.name = None
     
    
  • Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios: 1. The full row data (all columns) 2. For year 2013 3. For years 1980 to 1985


  •   
    # 1. the full row data (all columns)
    print(df_can.loc['Japan'])
    
    
    # alternate methods
    print(df_can.iloc[87])
    print(df_can[df_can.index == 'Japan'].T.squeeze())
    
    
    # 2. for year 2013
    print(df_can.loc['Japan', 2013])
    
    # alternate method
    print(df_can.iloc[87, 36]) # year 2013 is the last column, with a positional index 
    
    
    # 3. for years 1980 to 1985
    print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])
    print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])
     
    
  • Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.


  • To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.


  •   
    df_can.columns = list(map(str, df_can.columns))
    [print (type(x)) for x in df_can.columns.values] #uncomment to check type of column headers
     
    
  • Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:


  •   
    # useful for plotting later on
    years = list(map(str, range(1980, 2014)))
    years
     
    
  • To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.


  • For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).


  •   
    # 1. create the condition boolean series
    condition = df_can['Continent'] == 'Asia'
    print(condition)
    
    
    # 2. pass this condition into the dataFrame
    df_can[condition]
        
    # we can pass mutliple criteria in the same line. 
    # let's filter for AreaNAme = Asia and RegName = Southern Asia
    
    df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]
    
    # note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'
    # don't forget to enclose the two conditions in parentheses
     
    
  • Before we proceed: let's review the changes we have made to our dataframe.


  •   
    print('data dimensions:', df_can.shape)
    print(df_can.columns)
    df_can.head(2)
     
    
  • Matplotlib: Standard Python Visualization Library The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:


  • Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.


  • If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.


  • Matplotlib.Pyplot One of the core aspects of Matplotlib is matplotlib.pyplot. It is Matplotlib's scripting layer which we studied in details in the videos about Matplotlib. Recall that it is a collection of command style functions that make Matplotlib work like MATLAB.


  • Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.


  • In this lab, we will work with the scripting layer to learn how to generate line plots. In future labs, we will get to work with the Artist layer as well to experiment first hand how it differs from the scripting layer.


  • Let's start by importing Matplotlib and Matplotlib.pyplot as follows:


  •   
    # we are using the inline backend
    %matplotlib inline 
    
    
    import matplotlib as mpl
    import matplotlib.pyplot as plt
     
    
  • *optional: check if Matplotlib is loaded.


  •  
    print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
    
     
    
  • *optional: apply a style to Matplotlib.


  •  
    print(plt.style.available)
    mpl.style.use(['ggplot']) # optional: for ggplot-like style
     
    
  • Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.


  • What is a line plot and why use it? A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.


  • Let's start with a case study: In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:


  • Question: Plot a line graph of immigration from Haiti using df.plot().


  • First,we will extract the data series for Haiti.


  •  
    haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
    haiti.head()
     
    
  • Next, we will plot a line plot by appending .plot() to the haiti dataframe.


  •  
    haiti.plot()
     
    
  • pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.


  • Also, let's label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:


  •  
    haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
    haiti.plot(kind='line')
    
    
    plt.title('Immigration from Haiti')
    plt.ylabel('Number of immigrants')
    plt.xlabel('Years')
    
    
    plt.show() # need this line to show the updates made to the figure
     
    
  • We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text() method.


  •  
    haiti.plot(kind='line')
    
    
    plt.title('Immigration from Haiti')
    plt.ylabel('Number of Immigrants')
    plt.xlabel('Years')
    
    
    # annotate the 2010 Earthquake. 
    # syntax: plt.text(x, y, label)
    plt.text(2000, 6000, '2010 Earthquake') # see note below
    
    
    plt.show() 
     
    
  • With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!


  • Quick note on x and y values in plt.text(x, y, label):


  • Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.


  •  
        plt.text(2000, 6000, '2010 Earthquake') # years stored as type int
         
    
  • If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.


  •  
        plt.text(20, 6000, '2010 Earthquake') # years stored as type int
         
    
  • We will cover advanced annotation methods in later modules.


  • We can easily add more countries to line plot to make meaningful comparisons immigration from different countries


  • Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().


  •  
    df_CI.plot(kind='line')
     
    
  • Plot from above doesn't look right...


  • Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis.


  • Since df_CI is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe using transpose() method to swap the row and columns.


  •  
    
    df_CI = df_CI.transpose()
    df_CI.head()
     
    
  • pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.


  •  
    df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting
    df_CI.plot(kind='line')
    plt.title('Immigrants from China and India')
    plt.ylabel('Number of Immigrants')
    plt.xlabel('Years')
    plt.show()
     
    
  • From the above plot, we can observe that the China and India have very similar immigration trends through the years.


  • Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?


  • That's because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.


  •  
    print(type(haiti))
    print(haiti.head(5))
    
     
    
  • Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.


  • Congratulations! you have learned how to wrangle data with python and create a line plot with Matplotlib.


  • There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing kind keyword to plot().


  • The full list of available plots are as follows:


    1. bar for vertical bar plots


    2. barh for horizontal bar plots


    3. hist for histogram


    4. box for boxplot


    5. kde or density for density plots


    6. area for area plots


    7. pie for pie plots


    8. scatter for scatter plots


    9. hexbin for hexbin plot