Python Introduction to Pandas - UPSCFEVER

The Python tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.

Colab link - Open colab

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can think of pandas as an extremely powerful version of Excel, with a lot more features.

About iPython Notebooks - iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. You only need to write code between the ### START CODE HERE ### and ### END CODE HERE ### comments. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run Cell" (denoted by a play symbol) in the left bar of the cell.

In this notebook you will learn -**

Series

DataFrames

Missing Data

GroupBy

Merging, Joining and Concatenating

Operations

Data Input and Output

Importing Pandas To import Pandas simply write the following:

 
import pandas as pd
import numpy as np

DataFrames DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

you also go through tutorial video: (https://www.youtube.com/watch?v=e60ItwlZTKM)

 
from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

df

Selection and Indexing Let's learn the various methods to grab data from a DataFrame.

 
df['W']

Pass a list of column names

 
df[['W','Z']]

# SQL Syntax (NOT RECOMMENDED!)
df.W

DataFrame Columns are just Series.

 
type(df['W'])

Creating a new column:

 

df['new'] = df['W'] + df['Y']

df

Removing Columns: - Note:** axis = 0 for selecting rows and axis = 1 for selecting columns.

 
df.drop('new',axis=1)

# Not inplace unless specified!
df

df.drop('new',axis=1,inplace=True)

df

Can also drop rows this way:

 
df.drop('E',axis=0)

Selecting Rows:

 

df.loc['A']

Or select based off of index instead of label

 
df.iloc[2]

Selecting subset of rows and columns

 

df.loc['B','Y']

df.loc[['A','B'],['W','Y']]

Exercise 2.1: Write a Pandas program to devide a DataFrame

 
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 2))
print("Original DataFrame:")
print(df)
part_70 = df.sample(frac=0.7,random_state=10)
part_30 = df.drop(part_70.index)
print("\n70% of the said DataFrame:")
print(part_70)
print("\n30% of the said DataFrame:")
print(part_30)

Exercise 2.2: Write a Pandas program to count number of columns of a DataFrame.

 
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
print("\nNumber of columns:")
print(len(df.columns))

Conditional Selection An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

 
df

df>0

Here you can see that it gives boolean values instead of the actual numbers. So we need to pass the argument in the form of a list.

 
df[df>0]

df[df['W']>0]

df[df['W']>0]['Y']

df[df['W']>0][['Y','X']]

For two conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]

More Index Details Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

 
df

# Reset to default 0,1...n index
df.reset_index()

newind = 'CA NY WY OR CO'.split()

df['States'] = newind

df

df.set_index('States')

df

df.set_index('States',inplace=True)

df

Multi-Index and Index Hierarchy Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

 
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

hier_index

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

 
df.loc['G1']

df.loc['G1'].loc[1]

df.index.names

df.index.names = ['Group','Num']

df

df.xs('G1')

df.xs(['G1',1])

df.xs(1,level='Num')