The Python tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.
Colab link - Open colab
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can think of pandas as an extremely powerful version of Excel, with a lot more features.
About iPython Notebooks - iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. You only need to write code between the ### START CODE HERE ### and ### END CODE HERE ### comments. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run Cell" (denoted by a play symbol) in the left bar of the cell.
In this notebook you will learn -**
Series
DataFrames
Missing Data
GroupBy
Merging, Joining and Concatenating
Operations
Data Input and Output
Importing Pandas To import Pandas simply write the following:
import pandas as pd
import numpy as np
DataFrames DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.
you also go through tutorial video: (https://www.youtube.com/watch?v=e60ItwlZTKM)
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
Selection and Indexing Let's learn the various methods to grab data from a DataFrame.
df['W']
Pass a list of column names
df[['W','Z']]
# SQL Syntax (NOT RECOMMENDED!)
df.W
DataFrame Columns are just Series.
type(df['W'])
Creating a new column:
df['new'] = df['W'] + df['Y']
df
Removing Columns: - Note:** axis = 0 for selecting rows and axis = 1 for selecting columns.
df.drop('new',axis=1)
# Not inplace unless specified!
df
df.drop('new',axis=1,inplace=True)
df
Can also drop rows this way:
df.drop('E',axis=0)
Selecting Rows:
df.loc['A']
Or select based off of index instead of label
df.iloc[2]
Selecting subset of rows and columns
df.loc['B','Y']
df.loc[['A','B'],['W','Y']]
Exercise 2.1: Write a Pandas program to devide a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 2))
print("Original DataFrame:")
print(df)
part_70 = df.sample(frac=0.7,random_state=10)
part_30 = df.drop(part_70.index)
print("\n70% of the said DataFrame:")
print(part_70)
print("\n30% of the said DataFrame:")
print(part_30)
Exercise 2.2: Write a Pandas program to count number of columns of a DataFrame.
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
print("\nNumber of columns:")
print(len(df.columns))
Conditional Selection An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
df>0
Here you can see that it gives boolean values instead of the actual numbers. So we need to pass the argument in the form of a list.
df[df>0]
df[df['W']>0]
df[df['W']>0]['Y']
df[df['W']>0][['Y','X']]
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
More Index Details Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
df
# Reset to default 0,1...n index
df.reset_index()
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
df.set_index('States')
df
df.set_index('States',inplace=True)
df
Multi-Index and Index Hierarchy Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1']
df.loc['G1'].loc[1]
df.index.names
df.index.names = ['Group','Num']
df
df.xs('G1')
df.xs(['G1',1])
df.xs(1,level='Num')