Pandas Module in Python

Create and manipulate Dataframes easily!

Pandas Module in Python

What can the Pandas module do?

Pandas module is a massive library in python that can create and manipulate data frames (tables). It is an essential module to learn to enter the data science field due to its ability to handle millions of cells fastly and efficiently.


Installing the required modules

We need both pandas and numpy to use the methods inside pandas module. You can run these 2 commands on the CMD to install both of them:

pip install pandas
pip install numpy

Importing the required modules

import pandas as pd
import numpy as np

Create a series in Pandas

A series is a one-dimensional array with indexed elements. The indexes can be consecutive numbers as a default or can be specified using a parallel list.

# prepare some data 
labels = ['a','b','c','d']
my_list = [100,200,300,400]

# create a series (indexed vector) from the list
pd.Series(my_list) 
# 0    100
# 1    200
# 2    300
# 3    400

# create a series with data and indexes respectively
pd.Series(my_list, labels)  
# a    100
# b    200
# c    300
# d    400

# use the dictionary for both indexes and data (keys, values)
pd.Series(dic) 
# a    1
# b    2
# c    3
# d    4

# use the numpy arrays as lists
pd.Series(arr) 
# 0    1
# 1    2
# 2    3

# change the name of the series 
pd.Series(dic, name="my list")

Processing the Pandas series

# prepare some data 
ser1 = pd.Series(data=[1,2,3], index=['a','b','c'])
ser2 = pd.Series(data=[1,2,3,4], index=['b','c','a','d'])
# we can perform mathematical operations on the data, 
# but it will gather them by their corresponding (index) attribute, NOT their order

# add the (a)s together, the (b)s, and the (c)s, 
# but the (d) will be assigned to (NaN) because it has no identical pair in the other series
# you can perform any other math operation 
ser1 + ser2
# a    4.0
# b    3.0
# c    5.0
# d    NaN

# --- info about a series ---
# return the data stored in the series 
ser1.data 
# return the index value of the series 
ser1.index 
# return the data type of items in the series 
ser1.dtype
# return the name of the series
ser1.name

Creating a Pandas-dataframes

A dataframe is a two-dimensional array with indexed (labeled) elements for both rows and columns.

Using only a dictionary

We use a dictionary where keys represent columns’ labels, and values represent data for each column. rows-labels are assigned by default to (0, 1, 2, …) but columns-labels should be given

# preparing data 
data = { 'x': [1, 2, 3], 'y': [ 2, 4, 8 ], 'z': 100 }

# create a dataframe
pd.DataFrame(data)

Using a dictionary and lists

We can specify labels for rows and columns using lists.

# preparing data 
data = { 'x': [1, 2, 3], 'y': [ 2, 4, 8 ], 'z': 100 }
rows_labels = [100, 200, 300]
columns_labels = ['z', 'y', 'x']

# create a dataframe
pd.DataFrame(data, index=rows_labels, columns=columns_labels)

Using a numpy array and lists

We can use a 2-dimensional numpy array to fill the data.

# preparing data 
data = np.array([["China",1393000000,"Beijing"], 
            ["United States",328200000,"Washington"],
            ["Russia",144500000,"Moscow"],
            ["Japan",126500000,"Tokyo"],
            ["India",1353000000,"New Delhi"],
            ["Germany",83020000,"Berlin"]])

rows_labels = ['1','2','3','4','5','6']
columns_labels = ['a','b','c']

# create a dataframe
df = pd.DataFrame(data, index=rows_labels, columns=columns_labels)

Extract data from dataframe

Prepare the dataframe:

# preparing data 
myData = { 
 'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
 'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai','Manchester', 'Cairo', 'Osaka'],
 'age': [41, 28, 33, 34, 38, 31, 37],
 'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]  }  

rowLabels = [101, 102, 103, 104, 105, 106, 107]  

# create a dataframe
myArray = pd.DataFrame(data=myData, index=rowLabels)

Extracting labels

# return a series with the rows' labels
myArray.index
# return a series with the columns' labels
myArray.columns

Extracting cells' values

# return a dataframe with the cells' values
myArray.values()

Extracting rows

# return a sub dataframe with the first two rows
myArray.head(n=2) 
# return a sub dataframe with the last two rows
myArray.tail(n=2)

Extracting columns

# return a series with the values of that column
myCities = myArray["city"]

# another method to get a column's values
myCities = myArray.city

# return a dataframe with the specified columns
myArray[["city", "age"]] 

# reach the corresponding cell
myCities[102]

Extracting by (loc)

(loc) is used to extract slices of data by providing the names of rows and columns. Its borders are both inclusive, which means the range [5, 10] means numbers from 5 to 10

loc['IndexLabel', 'ColumnName']

myCities = myArray["city"]

# return a row indexed by 103 
myCities.loc[103]

# return rows from 101 to 105
myArray.loc[101:105] 

# return rows from the beginning to 104
myArray.loc[:105]

# return rows from 104 to the end
myArray.loc[104:]

# return all rows
myArray.loc[:]

# return rows for columns labeled under (name) and (city)
myArray.loc[:, ['name', 'city']] 

# return rows where the corresponding attribute (age) is less than 35
myArray.loc[myArray["age"] < 35]

# return rows where its corresponding attribute (city) is in the list [“Cairo”, “Osaka”]
myArray.loc[myArray.city.isin(["Cairo", "Osaka"])] 

# return rows where its corresponding attribute (city) is in 
# the list [“Cairo”, “Osaka”] for just the columns (city, age)
myArray.loc[myArray.city.isin(["Cairo", "Osaka"]), ["city", "age"]] 

# return rows where its corresponding index is within the specified range
myArray.loc[myArray.index.isin(range(102, 107))]

Extracting by (iloc)

(iloc) is used to extract slices of data (rows) by providing the indexes of rows and columns. Its lower border is inclusive but the upper border is exclusive, which means the range [5, 10] means numbers from 5 to 9.

# return the third row
myArray.iloc[2]

# return the third cell of that column (city)
myArray["city"].iloc[2]

# return the rows (2, 3, 4)
myArray.iloc[2:5]

# return all rows
myArray.iloc[:]

# return rows from the beginning to the row 4
myArray.iloc[:5]

# return rows from 2 to the end
myArray.iloc[2:] 

# return rows from 101 to 105 and columns (0) and (1)
myArray.iloc[101:105, [0, 1]]

# return rows 1 through 6 with a step of 2 between each two, and the column indexed by (0)
myArray.iloc[1:6:2, 0]

# rows 1 through 6 with a step of 2
myArray.iloc[slice(1, 6, 2), 0]

# rows 1 through 6 with a step of 2
myArray.iloc[pd.IndexSlice[1:6:2], 0]

Extracting a cell by (at & iat)

We use (at) to get a cell by its labels. And we use (iat) to get a cell by its indexes.

# return the cell at the row labeled by (102) and the column labeled by (name) 
myArray.at[102, 'name']

# return the cell at the row indexed by (2) and the column indexed by (0)
myArray.iat[2, 0]

Modify the data in the dataframe

Modifying a single cell

# change the value at the cell indexed by (3, 2) to 49
myArray[3, 2] = 49

Modifying columns

# modify cells under (py-score) that is corresponding with rows 
# from 101 to 104 to be [40, 50, 60, 70] 
myArray.loc[:104, 'py-score'] = [40, 50, 60, 70]

# modify cells under (py-score) and rows from 105 to the end to be zeros
myArray.loc[105:, 'py-score'] = 0

# modify all rows (:) and the last column (-1) to be as the entered array
myArray.loc(:, -1) = np.array([88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0])

Adding rows

# creating the row as a pandas series
john = pd.Series(data=['John', 'Boston', 34, 79], index= myArray.columns, name=108)

# add the series (john) to the array (myArray) labeled with (108) 
myArray = myArray.append(john)

Adding columns

# prepare the data 
column_data = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0]) 

# add a new column labelled by (js-score) with the given values
myArray['js-score'] = column_data

# insert a new column at index (3), labeled by (django-score), and shift the other columns
myArray.insert(loc=3, column='django-score', value=column_data)

Deleting rows

# delete the entire row labeled by (108)
myArray.drop(labels=[108]) 

# (axis=0) means that we want to delete a row
myArray = myArray.drop(labels=[108], axis=0)

Deleting columns

# delete the entire column labeled by (django-score)
del myArray["django-score"] 

# delete the column (age) 
# (axis=1) means that we want to delete a column
myArray = myArray.drop(labels="age", axis=1)

Applying arithmetic operations

# add each value from (py-score) to its corresponding value from (js-score) 
# then return a list of resultant cells 
myArray[py-score] + myArray[js-score] 
# (that also works for every other operation: - , * , ** , / )

# divide each element of the column by 100
myArray[py-score] / 100 
# another way to divide by 100
myArray.py-score / 100

Sorting data

# sort the rows in a descending order according to the values of the column (js-score)
myArray.sort_values(by='js-score', ascending=False)

# sort the rows in an ascending order according to the values of the column (js-score)
myArray.sort_values(by='js-score', ascending=True)

Data Filtration

# return a series of boolean values representing results of the condition 
# (is the value larger than or equal 80)
myFilter = myArray['django-score']  >= 80

# apply the filter on the Dataframe, so only return rows in which 
# their corresponding elements in (myFilter) have the value "True") 
myArray[myFilter] 

# --- only for numeric dataframes (full of numbers) ---

# return new dataframe with only values that satisfy the condition
# and replace others with (NaN)
myArray[myArray > 50] 

# return rows where values of the column ‘u’ satisfy the condition 
myArray[myArray['u'] > 50] 

# we can cascade multiple conditions using AND ( & ) , OR ( | )
myArray [(myArray ['z']>50) | (myArray ['w']>50)]

Manipulating the dataframe

Manipulating the whole DF

# turn the data from the table into a numpy array, 
# BUT it will only take the data without the rowsand columns’ labels
myArray.to_numpy()

# convert the whole column’s cells into strings 
myArray.age.astype("str") 

# replace each cell has that (Cairo) as (city) value with (US)
myArray.city.replace("Cairo", "US") 

# the (apply) works exactly like the (map) function 
# it applys a specified function to the whole series
myArray.age.apply(lambda x: x + 5) 
# we can pass the whole table row by row to the (apply) function 
# then get the cell values within the lambda
myArray.apply(lambda row: row.age + 5, axis='columns')

Renaming labels, indexes, and axises

# rename the column (city) to be (country), and the column (name) to be (person)
myArray.rename(columns={"city": "country", "name": "person"})

# rename the index (0) to be (firstEntry), and the index (1) to be (secondEntry)
myArray.rename(index={0: 'firstEntry', 1: 'secondEntry'}) 

# rename the whole y-axis (that holds the indexes) to be (samples), 
# and the whole x-axis (that holds the columns’ labels) to (info)
myArray.rename_axis("samples", axis='rows').rename_axis("info", axis='columns')

Joining Dataframes

Let's say we have two dataframes, DF1 and DF2:

# using concat
# combine the two Dataframes together 
pd.concat([DF1, DF2]) 

# using join
# set the common indexes for both dataframes 
left = DF1.set_index(['title', 'col3']) 
right = DF2.set_index(['title', 'col3'])
# join them and specify the suffix for the left and right 
# sides of the combined dataframe as strings
left.join(right, lsuffix='_CANADA', rsuffix='_UK')

Dealing with Nan (null) values

If we have some missing data, we can replace them with the element (np.nan) which will be replaced in the table with (NaN):

newData = pd.DataFrame({'x': [1, 2, np.nan, 4]})

Check for NaNs

# return a Boolean array indicates if the value is Nan or not
newData.isnull()

Investigate with skipping NaNs

# calculate the mean without using the missing ones
newData.mean() 
# skip the whole column that has a missing data 
newData.mean(skipna=False)

Filling the NaNs

# fill the missing data with a specific value (0.0)
newData.fillna(value=0.0 ) 
# fill the missing data with its further one  
newData.fillna(method='fill') 
# fill the missing data with its previous one 
newData.fillna(method='bfill') 
# fill the missing data with the interpolated value between values that above and below it
newData.interpolate()

Deleting the NaNs

# delete the rows that have missing data 
newData.dropna() 
# delete the rows that have missing data
newData.dropna() 
# delete the rows that have missing data
newData.dropna(axis=0) 
# delete the columns that have missing data
newData.dropna(axis=1) 
# set the deletion threshold to be (2), which will 
# delete rows that have at least 2 NaN (null) values
newData.dropna(thresh=2)

Investigating the dataframe

Type and dimensions

# return the number of dimensions of the array 
myArray.ndim 

# return the data type of each column’s data
myArray.dtype 
# return the data type for that column’s data
myArray.["py-score"].dtype

Describing the data

# calculate some statistics for the values of each column of the array, like:
# number of the values (count), mean, std, min, max, some values-summation percentages 
myArray.describe()
# only describe the column (city)
myArray["city"].describe() 
# another way to describe the column (city)
myArray.city.describe()
# invert the orientation of the description table
myArray.describe().transpose()  
# describe a specific column
myArray.describe().transpose()["city"]

Statistics about the data

# returns the mean value for each column
myArray.mean()  
# returns the maximum value for each column 
myArray.max() 
# returns the minimum value for each column
myArray.min() 
# returns the summation of each column
myArray.sum() 

# returns the mean of a specific column 
myArray['py-score'].mean()
# returns the maximum value for that column 
myArray['py-score'].max() 
# returns the minimum value for that column
myArray['py-score'].min() 
# returns the summation of a specific column
myArray['py-score'].sum() 
# returns the max number among them even if there were Nans 
myArray['py-score'].idxmax() 

# in case we have duplicates in a particular column, this will 
# return data that have been occurred once and ignore the duplicates
myArray['py-score'].unique()  

# return number of occurences for each item in that column  
myArray['py-score'].value_counts()

# calculate how many null value in that column
pd.isnull(myArray['py-score']).sum()

Grouping data in the dataframe

Prepare data

# prepare the data
data = { 
      'Company':['google','google','facebook','facebook','twitch','twitch'],
      'Person':['mohammad','ahmad','dana','anas','yazan','khader'],
      'Sales':[250,333,542,620,324,243] 
}

# create the DF
myData = pd.DataFrame(data)

Group the data

# combine the identical items by the category (company), 
# the column (Company) will be the new index of the Dataframe, 
# and all its unique rows will be the categories of the Dataframe
myGroup = myData.groupby('Company') 

# group rows by two indexes, (Company) and (Person)
myGroup = myData.groupby(["Company", "Person"])

Investigate the group

# return the column (Sales) for the grouped Dataframe 
myGroup.Sales

# mean (average) of numeric items for each category
myGroup.mean()
# summation of numeric items for each category
myGroup.sum() 
# maximum numeric for each category
myGroup.max() 
# minimum numeric for each category
myGroup.min() 
# standard deviation for each category
myGroup.std() 
# counts for all values for each category
myGroup.count() 
# counts for all values for each category
myGroup.Company.size() 
# sorts the values for each category in an ascendant order
myGroup.sort_index() 

# run a bunch of functions (len, min, max) on each category, 
# so the column-labels will be (len, min, max) and each cell will have 
# the corresponding value from the column (Salary) applied on it the corresponding function 
myGroup.Salary.agg([len, min, max]) 

# sort the values in an ascending order by their (max) values for each category 
myGroup.Salary.sort_values(by="max") 
# sort the values in a descending order by two standers (min, max) for each category 
myGroup.Salary.sort_values(by=["min", "max"], ascending=False)

Revert the group to a DF

# convert the group back to be a regular dataframe with the default index
myGroup.reset_index()

Dealing with CSV files

# load the data from that path
data = pd.read_csv( r"assets\file.csv", index_col=0) 

# read a particular sheet from an Excel file
data = pd.read_excel("file.xlsx", sheetname="Sheet1") 

# save this table (dataframe) as a CSV file with that name 
myDataFrame.to_csv("myData.csv")

Dealing with time

we can use the time variable with the data frame.

Prepare the data

The following data are for temperatures measured every one hour for 6 hours

temp_Cel = [8.0,  7.1,  6.8, 6.4,  6.0,  5.4]

Prepare the time data

Now, create a time line starts on the date (27th October 2019) at (00:00:00.0) o’clock and has 6 values (periods) separated by 1 hour (H) between each two:

temp_Time = pd.date_range( start='2019-10-27 00:00:00.0', periods=6, freq='H')

# we get :
DatetimeIndex([ 
'2019-10-27 00:00:00', 
'2019-10-27 01:00:00', 
'2019-10-27 02:00:00', 
'2019-10-27 03:00:00', 
'2019-10-27 04:00:00', 
'2019-10-27 05:00:00' ])

Create the dataframe

Create a dataframe that has cells of (temp_Cel) and indexed by (temp_Time)

temp = pd.DataFrame(data={'temp_Cel': temp_Cel}, index=temp_Time)

# we get the following:

#   temp_cel
#   2019-10-27 00:00:00     8.0
#   2019-10-27 01:00:00     7.1
#   2019-10-27 02:00:00     6.8
#   2019-10-27 03:00:00     6.4
#   2019-10-27 04:00:00     6.0
#   2019-10-27 05:00:00     5.4

Get info about the data

# return rows from 1 to 4 of the data frame (temp):
temp['2019-10-27 01': '2019-10-27 04'] 

# resample the dataframe and calculate the mean of each two rows (2h) :
temp.resample(rule='2h').mean()
# for each row, calculate the mean of its previous 3 rows 
temp.rolling(window=3).mean()