Introduction
Pandas is a powerful open-source library for data manipulation and analysis in Python. It provides easy-to-use and efficient data structures for working with labeled data, including tabular data, time series data, and more. With Pandas, you can load, clean, transform, analyze, and visualize data quickly and easily. To name a few bullet points:
Data manipulation: Pandas makes it easy to manipulate and transform data, such as filtering rows, selecting columns, grouping data, merging data from multiple sources, and more.
Data cleaning: Data often needs to be cleaned before it can be analyzed, and Pandas provides powerful tools for cleaning and preprocessing data, such as handling missing values, removing duplicates, and handling data types.
Data analysis: Pandas provides a wide range of functions for analyzing data, such as computing summary statistics, calculating correlations, and performing time series analysis.
Data visualization: Pandas integrates with other popular visualization libraries in Python, such as Matplotlib and Seaborn, to help you create insightful and visually appealing charts and graphs.
Overall, learning how to work with Pandas is essential for anyone who works with data in Python, whether you're a data analyst, data scientist, or developer. It can save you time and effort in cleaning, transforming, and analyzing data, allowing you to focus on generating insights and value from your data.
Descriptive Core Concepts
Pandas provides two primary data structures for working with labeled data: Series and DataFrame.
Series
A Series is a one-dimensional labeled array that can hold data of any type (e.g., integers, floats, strings, etc.). It's similar to a column in a spreadsheet or a database table, and can be thought of as a single column of data.
DataFrame
A DataFrame, on the other hand, is a two-dimensional labeled data structure that can hold data of different types (e.g., a mix of integers, floats, and strings). It's similar to a spreadsheet or a database table, and can be thought of as a collection of Series that share the same index.
In a DataFrame, rows represent observations or records, while columns represent variables or features. Each column is a Series, and can be accessed and manipulated individually or collectively.
DataFrames are the primary data structure used in Pandas, and provide a powerful and flexible way to manipulate and analyze labeled data. They can be loaded from a variety of data sources, such as CSV files, Excel spreadsheets, SQL databases, and more. Once loaded, they can be filtered, transformed, merged, and visualized using a wide range of Pandas functions and methods.
Short Practical Core Concepts
Initialization
import pandas as pd
# pd.Series
# ===============
# initialize a series from a list
s1 = pd.Series([1, 2, 3, 4, 5])
# Indices are automatically generated starting from 0.
print(s1)
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# dtype: int64
# initialize a series from a dictionary
s2 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# Indices are set equal to the keys of the dictionary.
print(s2)
# a 1
# b 2
# c 3
# dtype: int64
# another way to set custom indices is by using the following syntax:
s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s3)
# a 1
# b 2
# c 3
# dtype: int64
# initialize a series with custom name
s4 = pd.Series([1, 2, 3], name='my_series')
# 0 1
# 1 2
# 2 3
# Name: my_series, dtype: int64
# ===============
# pd.DataFrame
# ===============
# initialize a dataframe from a list of lists
data = [['John', 23, 'Male'], ['Lisa', 32, 'Female'], ['David', 45, 'Male']]
df1 = pd.DataFrame(data, columns=['Name', 'Age', 'Gender'])
print(df1)
# Name Age Gender
# 0 John 23 Male
# 1 Lisa 32 Female
# 2 David 45 Male
# initialize a dataframe from a dictionary of lists
data = {'Name': ['John', 'Lisa', 'David'], 'Age': [23, 32, 45], 'Gender': ['Male', 'Female', 'Male']}
df2 = pd.DataFrame(data)
print(df2)
# Name Age Gender
# 0 John 23 Male
# 1 Lisa 32 Female
# 2 David 45 Male
# initialize a dataframe with custom index and columns
data = {'Name': ['John', 'Lisa', 'David'], 'Age': [23, 32, 45], 'Gender': ['Male', 'Female', 'Male']}
df3 = pd.DataFrame(data, index=['a', 'b', 'c'], columns=['Name', 'Age', 'Gender'])
print(df3)
# Name Age Gender
# a John 23 Male
# b Lisa 32 Female
# c David 45 Male
# initialize a dataframe from a CSV file
df4 = pd.read_csv('my_data.csv')
# ===============
Immediate Data Statistical Analysis
# Shows the top n rows (default=5).
print(df1.head(n=2))
# Name Age Gender
# 0 John 23 Male
# 1 Lisa 32 Female
# Shows the bottom n rows (default=5). Keeps the order.
print(df1.tail(n=2))
# Name Age Gender
# 1 Lisa 32 Female
# 2 David 45 Male
# The most basic statistical data analysis method. Gives a short overview of the DataFrame.
print(df1.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 3 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Name 3 non-null object
# 1 Age 3 non-null int64
# 2 Gender 3 non-null object
# dtypes: int64(1), object(2)
# memory usage: 200.0+ bytes
# None
# A more descriptive version of `.info`.
print(df1.describe())
# Name Age Gender
# count 3 3.000000 3
# unique 3 NaN 2 => For Categorical Variables Only
# top John NaN Male => Mode. For Categorical Variables Only
# freq 1 NaN 2 => For Categorical Variables Only
# mean NaN 33.333333 NaN
# std NaN 11.060440 NaN
# min NaN 23.000000 NaN
# 25% NaN 27.500000 NaN
# 50% NaN 32.000000 NaN => Median
# 75% NaN 38.500000 NaN
# max NaN 45.000000 NaN
Aggregation
import pandas as pd
sales_data = {
'date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-02'],
'customer': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
'amount': [100, 50, 75, 200, 125]
}
df = pd.DataFrame(sales_data)
# group the data by date and calculate the total amount for each date
grouped = df.groupby('date').agg({'amount': 'sum'})
# print the resulting DataFrame
print(grouped)
# amount
# date
# 2022-01-01 150
# 2022-01-02 400
# group the data by date and customer, and calculate the total and average amount for each group
grouped = df.groupby(['date', 'customer']).agg({'amount': ['sum', 'mean']})
# print the resulting DataFrame
print(grouped)
# amount
# sum mean
# date customer
# 2022-01-01 Alice 100 100.0
# Bob 50 50.0
# 2022-01-02 Alice 75 75.0
# Bob 200 200.0
# Charlie 125 125.0
# This last DataFrame has 2 indices: ['date', 'customer']. pd.DataFrames can have up to 2 indices, maximum.
Data Analysis Exercise
Consider some files containing the most frequently used names in a country between years 2000 and 2010. The data has 3 columns: name
, sex
, count
Our objective is to find all unisex names.
import pandas as pd
# 1. Loading the data
# ===============
df = pd.read_csv('names/2000.csv')
print(df.head())
# Sophia F 21842
# 0 Isabella F 19910
# 1 Emma F 18803
# 2 Olivia F 17322
# 3 Ava F 15503
# 4 Emily F 14258
# So the csv file has no header, and Pandas is inferring the first row to be the header column. In order to avoid that we need to explicitly define the column names:
df = pd.read_csv('names/2000.csv', columns=["name", "sex", "count"])
print(df.head())
# name sex count
# 0 Sophia F 21842
# 1 Isabella F 19910
# 2 Emma F 18803
# 3 Olivia F 17322
# 4 Ava F 15503
# Now let's concatenate all data files and distinguish them with a new column called `year_of_birth`:
decade_df = pd.concat(
pd.read_csv(f'names/{year}.csv', columns=["name", "sex", "count"])
.assign(year_of_birth=year) # => Create new column with data
for year in range(2000,2011)
)
print(decade_df.info())
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 359302 entries, 0 to 34072
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 name 359302 non-null object
# 1 sex 359302 non-null object
# 2 count 359302 non-null int64
# 3 year 359302 non-null int64
# dtypes: int64(2), object(2)
# memory usage: 13.7+ MB
# ===============
# 2. Cleaning
# ===============
# There are some names that are only assigned to a specific gender, so we should remove them from our dataset.
# Firstly, we'll set the identifiers as the correct multi-index:
all_years.set_index(["name", "sex"], inplace=True)
print(all_years.head())
# count year
# name sex
# Emily F 25956 2000
# Hannah F 23082 2000
# Madison F 19968 2000
# Ashley F 17997 2000
# Sarah F 17702 2000
# Now, in order to filter the rows in a pd.DataFrame, we can pass a conditional clause to the [] index of the DataFrame to filter the indices.
# Consider the following command:
all_years.index # => Get the Index: [(<name>, <sex>), ...]
.get_level_values(level=1) # => Get the Index at level 1 => `sex`
# Now we will filter the DataFrames by gender in the following manner:
male_names = all_years[all_years.index.get_level_values(level=1) == 'M']
female_names = all_years[all_years.index.get_level_values(level=1) == 'F']
unisex_candidates =
male_names.index.get_level_values(level=0) # => Get the `name` Index
.intersection( # => Intersect the similar indices
female_names.index.get_level_values(level=0)
)
print(unisex_candidates)
# Index(['Jacob', 'Michael', 'Matthew', 'Joshua', 'Christopher', 'Nicholas',
# 'Andrew', 'Joseph', 'Daniel', 'Tyler',
# ...
# 'Roma', 'Rynn', 'Say', 'Shevy', 'Sparrow', 'Spirit', 'Tarryn', 'Violet',
# 'Wriley', 'Zeriah'],
# dtype='object', name='name', length=5125)
# Now let's filter the original dataframe for the candidates:
unisex_candidates_df =
all_years[
all_years.index.get_level_values(level=0)
.isin(unisex_candidates)
]
print(unisex_candidates_df.head())
# count year
# name sex
# Emily F 25956 2000
# Hannah F 23082 2000
# Madison F 19968 2000
# Ashley F 17997 2000
# Sarah F 17702 2000
# ===============
# 3. Calculate Total Count
# ===============
# With a clean dataset, now let's calculate the total name counts in all years for each name and gender. We need to perfom an aggregation function.
# Firstly, we'll group them:
each_name_n_sex = unisex_candidates_df.groupby(["name", "sex"])
# Next, we'll calculate them:
grouped_count_total_sum = each_name_n_sex["count"].sum()
print(grouped_count_total_sum.head())
# count
# name sex
# Aaden F 5
# M 2981
# Aadi F 5
# M 444
# Aadyn F 16
# Name: count, dtype: int64
# The result, is a pd.Series. However, the name is still equal to `count`, let's change it to a more comprehensive name:
grouped_count_total_sum.rename("total_count", inplace=True)
print(grouped_count_total_sum.head())
# total_count
# name sex
# Aaden F 5
# M 2981
# Aadi F 5
# M 444
# Aadyn F 16
# Name: total_count, dtype: int64
# As our final calculation, we need to find the male-to-female ratio for each name, we'll do so by a tuple indexing, getting all the names (:) and specific genders ('M' or 'F') and finding the ratio:
unisex_index = grouped_count_total_sum[:,'M'] / grouped_count_total_sum[:, 'F']
unisex_index.rename("unisex_index", axis=0, inplace=True)
print(unisex_index.head())
# unisex_index
# name
# Aaden 596.200000
# Aadi 88.800000
# Aadyn 15.187500
# Aalijah 1.709091
# Aaliyah 0.001195
# Name: sex_index, dtype: float64
# Finally, we will filter them by another indexing:
unisex_names = unisex_index[unisex_index > 0.5][unisex_index < 2].index
print(unisex_names)
# Index(['Aalijah', 'Aamari', 'Aarian', 'Aaris', 'Aarya', 'Aaryn', 'Aba',
# 'Abey', 'Abie', 'Abrar',
# ...
# 'Zamarie', 'Zarin', 'Zaryn', 'Zekiah', 'Zenith', 'Zi', 'Zian',
# 'Ziel', 'Ziyan', 'Zyian'],
# dtype='object', name='name', length=947)
Final Word
This exercise only addressed gender as male or female. We acknowledge that gender is a complex and multifaceted identity and that non-binary individuals may not identify within the gender binary. We recognize that our analysis is limited by this binary categorization and do not intend to exclude or invalidate the experiences of non-binary individuals.
Also, feel free to contact me if you have more questions. I hope you learned something from this article.