Introduction to Categorical Data
Base on DataCamp
Table of contents
- Introduction
- Categorical data in pandas
- dtype: categorical
- Creating a Categorical Series
- Why do we use categorical: memory
- Specifying dtypes when reading data
- Setting dtypes and saving memory
- Creating a categorical pandas Series
- Setting dtypes when reading data
- The basics of .groupby(): splitting data
- Setting up .groupby() statement
- Using pandas functions effectively
Introduction
What does it mean to be "categorical"?
Categorical
Finite number of groups (or categories)
These categories are usually fixed or known (gender, hair color, etc.)
Numerical
Known as qualitative data
Expressed using a numerical value
Usually a measurement (height, weight, IQ, etc.)
Ordinal vs nominal variables
Ordinal
- Categorical variables that have a natural order. e.g. Strongly Disagree (1), Disagree (2), Neutral (3), Agree (4), Strongly Agree (5).
Nominal
- Categorical variables that can't be placed into a natural order. e.g. Male, Female.
Our first dataset
dtype object is how pandas store strings and is a good indicator that a variable might be categorical.
We can explore further using the following method:
describe()
adult['Marital Status'].describe()
Out:
Count: 32561
Unique: 7
Top: married-civ-spouse
Freq: 14976
Name: Marital Status, dtype: object
value_counts()
This method is to get the frequency table in a panda series
adult['Marital Status'].value_counts() # With Normalize adult['Marital Status'].value_counts(normalize=True)
Exploring a target variable
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())
# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())
# Print relative frequency values
print(adult['Above/Below 50k'].value_counts(normalize=True))
Categorical data in pandas
By default, columns containing strings are not stored using pandas category dtype, as not every column containing strings needs to be categorical.
dtype: categorical
df['gender'] = df['gender'].astype('category')
df['gender'].dtype # CategoricalDtype(categories=['Male', 'Female'])
Creating a Categorical Series
data = ['A', 'A', 'C', 'B', 'C', 'A']
series = pd.Series(data, dtype='category')
ordered_series = pd.Categorical(data, categories=['C', 'B', 'A'], ordered=True)
Why do we use categorical: memory
adult = pd.read_csv("data/adult.csv")
adult['Marital Status'].nbytes # 260488
# Casting to category type
adult['Marital Status'] = adult['Marital Status'].astype('category')
adult['Marital Status'].nbytes # 32617
Specifying dtypes when reading data
dtype_dict = {'Marital Status': 'category'}
adult = pd.read_csv('data/adult.csv', dtype=dtype_dict)
Setting dtypes and saving memory
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)
# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)
# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")
# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)
Creating a categorical pandas Series
# Create a categorical Series and specify the categories (let pandas know the order matters!)
medals = pd.Categorical(medals_won, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
print(medals)
Setting dtypes when reading data
# Check the dtypes
print(adult.dtypes)
# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {
"Workclass": "category",
"Education": "category",
"Relationship": "category",
"Above/Below 50k": "category"
}
# Read in the CSV using the dtypes parameter
adult2 = pd.read_csv(
"adult.csv",
dtype=adult_dtypes
)
print(adult2.dtypes)
The basics of .groupby(): splitting data
groupby_object = adult.groupby(by=["Above/Below 50k"])
# Apply a function
groupby_object.mean()
# Specify Columns
groupby_object['Age', 'Education Num'].sum() # Options 1. Best Options
groupby_object.sum()[['Age', 'Education Num']] # Options 2
Setting up .groupby() statement
# Group the adult dataset by "Sex" and "Above/Below 50k"
gb = adult.groupby(by=['Sex', 'Above/Below 50k'])
# Print out how many rows are in each created group
print(gb.size())
# Print out the mean of each group for all columns
print(gb.mean())
Using pandas functions effectively
# Create a list of user-selected variables
user_list = ['Education', 'Above/Below 50k']
# Create a GroupBy object using this list
gb = adult.groupby(by=user_list)
# Find the mean for the variable "Hours/Week" for each group - Be efficient!
print(gb['Hours/Week'].mean())