Data Analysis Exercise with Python and Pandas

Data Analysis Exercise with Python and Pandas

ยท

9 min read

Introduction

Pandas is a powerful open-source library for data manipulation and analysis in Python. It provides easy-to-use and efficient data structures for working with labeled data, including tabular data, time series data, and more. With Pandas, you can load, clean, transform, analyze, and visualize data quickly and easily. To name a few bullet points:

  1. Data manipulation: Pandas makes it easy to manipulate and transform data, such as filtering rows, selecting columns, grouping data, merging data from multiple sources, and more.

  2. Data cleaning: Data often needs to be cleaned before it can be analyzed, and Pandas provides powerful tools for cleaning and preprocessing data, such as handling missing values, removing duplicates, and handling data types.

  3. Data analysis: Pandas provides a wide range of functions for analyzing data, such as computing summary statistics, calculating correlations, and performing time series analysis.

  4. Data visualization: Pandas integrates with other popular visualization libraries in Python, such as Matplotlib and Seaborn, to help you create insightful and visually appealing charts and graphs.

Overall, learning how to work with Pandas is essential for anyone who works with data in Python, whether you're a data analyst, data scientist, or developer. It can save you time and effort in cleaning, transforming, and analyzing data, allowing you to focus on generating insights and value from your data.

Descriptive Core Concepts

Pandas provides two primary data structures for working with labeled data: Series and DataFrame.

Series

A Series is a one-dimensional labeled array that can hold data of any type (e.g., integers, floats, strings, etc.). It's similar to a column in a spreadsheet or a database table, and can be thought of as a single column of data.

DataFrame

A DataFrame, on the other hand, is a two-dimensional labeled data structure that can hold data of different types (e.g., a mix of integers, floats, and strings). It's similar to a spreadsheet or a database table, and can be thought of as a collection of Series that share the same index.

In a DataFrame, rows represent observations or records, while columns represent variables or features. Each column is a Series, and can be accessed and manipulated individually or collectively.

DataFrames are the primary data structure used in Pandas, and provide a powerful and flexible way to manipulate and analyze labeled data. They can be loaded from a variety of data sources, such as CSV files, Excel spreadsheets, SQL databases, and more. Once loaded, they can be filtered, transformed, merged, and visualized using a wide range of Pandas functions and methods.

Short Practical Core Concepts

Initialization

import pandas as pd

# pd.Series
# ===============
# initialize a series from a list
s1 = pd.Series([1, 2, 3, 4, 5])
# Indices are automatically generated starting from 0.
print(s1)
#  0    1
#  1    2
#  2    3
#  3    4
#  4    5
#  dtype: int64

# initialize a series from a dictionary
s2 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# Indices are set equal to the keys of the dictionary.
print(s2)
#  a    1
#  b    2
#  c    3
#  dtype: int64

# another way to set custom indices is by using the following syntax:
s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s3)
#  a    1
#  b    2
#  c    3
#  dtype: int64

# initialize a series with custom name
s4 = pd.Series([1, 2, 3], name='my_series')
# 0    1
# 1    2
# 2    3
# Name: my_series, dtype: int64
# ===============
# pd.DataFrame
# ===============
# initialize a dataframe from a list of lists
data = [['John', 23, 'Male'], ['Lisa', 32, 'Female'], ['David', 45, 'Male']]
df1 = pd.DataFrame(data, columns=['Name', 'Age', 'Gender'])
print(df1)
#     Name  Age  Gender
# 0   John   23    Male
# 1   Lisa   32  Female
# 2  David   45    Male

# initialize a dataframe from a dictionary of lists
data = {'Name': ['John', 'Lisa', 'David'], 'Age': [23, 32, 45], 'Gender': ['Male', 'Female', 'Male']}
df2 = pd.DataFrame(data)
print(df2)
#     Name  Age  Gender
# 0   John   23    Male
# 1   Lisa   32  Female
# 2  David   45    Male

# initialize a dataframe with custom index and columns
data = {'Name': ['John', 'Lisa', 'David'], 'Age': [23, 32, 45], 'Gender': ['Male', 'Female', 'Male']}
df3 = pd.DataFrame(data, index=['a', 'b', 'c'], columns=['Name', 'Age', 'Gender'])
print(df3)
#     Name  Age  Gender
# a   John   23    Male
# b   Lisa   32  Female
# c  David   45    Male


# initialize a dataframe from a CSV file
df4 = pd.read_csv('my_data.csv')
# ===============

Immediate Data Statistical Analysis

# Shows the top n rows (default=5).
print(df1.head(n=2))
#    Name  Age  Gender
# 0  John   23    Male
# 1  Lisa   32  Female

# Shows the bottom n rows (default=5). Keeps the order.
print(df1.tail(n=2))
#    Name  Age  Gender
# 1   Lisa   32  Female
# 2  David   45    Male

# The most basic statistical data analysis method. Gives a short overview of the DataFrame.
print(df1.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 3 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   Name    3 non-null      object
#  1   Age     3 non-null      int64 
#  2   Gender  3 non-null      object
# dtypes: int64(1), object(2)
# memory usage: 200.0+ bytes
# None

# A more descriptive version of `.info`.
print(df1.describe())
#         Name        Age Gender
# count      3   3.000000      3
# unique     3        NaN      2 => For Categorical Variables Only
# top     John        NaN   Male => Mode. For Categorical Variables Only
# freq       1        NaN      2 => For Categorical Variables Only
# mean     NaN  33.333333    NaN
# std      NaN  11.060440    NaN
# min      NaN  23.000000    NaN
# 25%      NaN  27.500000    NaN
# 50%      NaN  32.000000    NaN => Median
# 75%      NaN  38.500000    NaN
# max      NaN  45.000000    NaN

Aggregation

import pandas as pd

sales_data = {
    'date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-02'],
    'customer': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
    'amount': [100, 50, 75, 200, 125]
}

df = pd.DataFrame(sales_data)

# group the data by date and calculate the total amount for each date
grouped = df.groupby('date').agg({'amount': 'sum'})

# print the resulting DataFrame
print(grouped)
#             amount
# date              
# 2022-01-01     150
# 2022-01-02     400

# group the data by date and customer, and calculate the total and average amount for each group
grouped = df.groupby(['date', 'customer']).agg({'amount': ['sum', 'mean']})

# print the resulting DataFrame
print(grouped)
#                  amount       
#                     sum   mean
# date       customer           
# 2022-01-01 Alice     100  100.0
#            Bob        50   50.0
# 2022-01-02 Alice      75   75.0
#            Bob       200  200.0
#            Charlie   125  125.0
# This last DataFrame has 2 indices: ['date', 'customer']. pd.DataFrames can have up to 2 indices, maximum.

Data Analysis Exercise

Consider some files containing the most frequently used names in a country between years 2000 and 2010. The data has 3 columns: name, sex, count Our objective is to find all unisex names.

import pandas as pd

# 1. Loading the data
# ===============
df = pd.read_csv('names/2000.csv')
print(df.head())
#      Sophia  F  21842
# 0  Isabella  F  19910
# 1      Emma  F  18803
# 2    Olivia  F  17322
# 3       Ava  F  15503
# 4     Emily  F  14258

# So the csv file has no header, and Pandas is inferring the first row to be the header column. In order to avoid that we need to explicitly define the column names:
df = pd.read_csv('names/2000.csv', columns=["name", "sex", "count"])
print(df.head())
#        name sex  count
# 0    Sophia   F  21842
# 1  Isabella   F  19910
# 2      Emma   F  18803
# 3    Olivia   F  17322
# 4       Ava   F  15503

# Now let's concatenate all data files and distinguish them with a new column called `year_of_birth`:
decade_df = pd.concat(
        pd.read_csv(f'names/{year}.csv', columns=["name", "sex", "count"])
            .assign(year_of_birth=year) # => Create new column with data
        for year in range(2000,2011)
     )
print(decade_df.info())
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 359302 entries, 0 to 34072
# Data columns (total 4 columns):
#  #   Column  Non-Null Count   Dtype 
# ---  ------  --------------   ----- 
#  0   name    359302 non-null  object
#  1   sex     359302 non-null  object
#  2   count   359302 non-null  int64 
#  3   year    359302 non-null  int64 
# dtypes: int64(2), object(2)
# memory usage: 13.7+ MB

# ===============
# 2. Cleaning
# ===============
# There are some names that are only assigned to a specific gender, so we should remove them from our dataset.
# Firstly, we'll set the identifiers as the correct multi-index:
all_years.set_index(["name", "sex"], inplace=True)
print(all_years.head())
#              count  year
# name    sex             
# Emily   F    25956  2000
# Hannah  F    23082  2000
# Madison F    19968  2000
# Ashley  F    17997  2000
# Sarah   F    17702  2000

# Now, in order to filter the rows in a pd.DataFrame, we can pass a conditional clause to the [] index of the DataFrame to filter the indices.
# Consider the following command:
all_years.index   # => Get the Index: [(<name>, <sex>), ...]
         .get_level_values(level=1) # => Get the Index at level 1 => `sex`

# Now we will filter the DataFrames by gender in the following manner:
male_names = all_years[all_years.index.get_level_values(level=1) == 'M']
female_names = all_years[all_years.index.get_level_values(level=1) == 'F']

unisex_candidates = 
  male_names.index.get_level_values(level=0) # => Get the `name` Index
            .intersection( # => Intersect the similar indices
               female_names.index.get_level_values(level=0)
             )
print(unisex_candidates)
# Index(['Jacob', 'Michael', 'Matthew', 'Joshua', 'Christopher', 'Nicholas',
#        'Andrew', 'Joseph', 'Daniel', 'Tyler',
#        ...
#        'Roma', 'Rynn', 'Say', 'Shevy', 'Sparrow', 'Spirit', 'Tarryn', 'Violet',
#        'Wriley', 'Zeriah'],
#       dtype='object', name='name', length=5125)

# Now let's filter the original dataframe for the candidates:

unisex_candidates_df =
  all_years[
    all_years.index.get_level_values(level=0)
                   .isin(unisex_candidates)
  ]
print(unisex_candidates_df.head())
#              count  year
# name    sex             
# Emily   F    25956  2000
# Hannah  F    23082  2000
# Madison F    19968  2000
# Ashley  F    17997  2000
# Sarah   F    17702  2000

# ===============
# 3. Calculate Total Count
# ===============
# With a clean dataset, now let's calculate the total name counts in all years for each name and gender. We need to perfom an aggregation function.
# Firstly, we'll group them:
each_name_n_sex = unisex_candidates_df.groupby(["name", "sex"])

# Next, we'll calculate them:
grouped_count_total_sum = each_name_n_sex["count"].sum()
print(grouped_count_total_sum.head())
#              count
# name   sex
# Aaden  F         5
#        M      2981
# Aadi   F         5
#        M       444
# Aadyn  F        16
# Name: count, dtype: int64

# The result, is a pd.Series. However, the name is still equal to `count`, let's change it to a more comprehensive name:
grouped_count_total_sum.rename("total_count", inplace=True)
print(grouped_count_total_sum.head())
#              total_count
# name   sex
# Aaden  F               5
#        M            2981
# Aadi   F               5
#        M             444
# Aadyn  F              16
# Name: total_count, dtype: int64

# As our final calculation, we need to find the male-to-female ratio for each name, we'll do so by a tuple indexing, getting all the names (:) and specific genders ('M' or 'F') and finding the ratio:
unisex_index = grouped_count_total_sum[:,'M'] / grouped_count_total_sum[:, 'F']
unisex_index.rename("unisex_index", axis=0, inplace=True)
print(unisex_index.head())
#          unisex_index
# name
# Aaden      596.200000
# Aadi        88.800000
# Aadyn       15.187500
# Aalijah      1.709091
# Aaliyah      0.001195
# Name: sex_index, dtype: float64

# Finally, we will filter them by another indexing:
unisex_names = unisex_index[unisex_index > 0.5][unisex_index < 2].index
print(unisex_names)
# Index(['Aalijah', 'Aamari', 'Aarian', 'Aaris', 'Aarya', 'Aaryn', 'Aba', 
#        'Abey', 'Abie', 'Abrar',
#        ...
#        'Zamarie', 'Zarin', 'Zaryn', 'Zekiah', 'Zenith', 'Zi', 'Zian', 
#        'Ziel', 'Ziyan', 'Zyian'],
#       dtype='object', name='name', length=947)

Final Word

This exercise only addressed gender as male or female. We acknowledge that gender is a complex and multifaceted identity and that non-binary individuals may not identify within the gender binary. We recognize that our analysis is limited by this binary categorization and do not intend to exclude or invalidate the experiences of non-binary individuals.

Also, feel free to contact me if you have more questions. I hope you learned something from this article.

ย