Introduction to Categorical Data

Introduction

What does it mean to be "categorical"?

Categorical

Finite number of groups (or categories)
These categories are usually fixed or known (gender, hair color, etc.)

Numerical

Known as qualitative data
Expressed using a numerical value
Usually a measurement (height, weight, IQ, etc.)

Ordinal vs nominal variables

Ordinal

Categorical variables that have a natural order. e.g. Strongly Disagree (1), Disagree (2), Neutral (3), Agree (4), Strongly Agree (5).

Nominal

Categorical variables that can't be placed into a natural order. e.g. Male, Female.

Our first dataset

dtype object is how pandas store strings and is a good indicator that a variable might be categorical.

We can explore further using the following method:

describe()
```
  adult['Marital Status'].describe()
```
Out:

Count: 32561

Unique: 7

Top: married-civ-spouse

Freq: 14976

Name: Marital Status, dtype: object

value_counts()

This method is to get the frequency table in a panda series

  adult['Marital Status'].value_counts()
  # With Normalize
  adult['Marital Status'].value_counts(normalize=True)

Exploring a target variable

# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

# Print relative frequency values
print(adult['Above/Below 50k'].value_counts(normalize=True))

Categorical data in pandas

By default, columns containing strings are not stored using pandas category dtype, as not every column containing strings needs to be categorical.

dtype: categorical

df['gender'] = df['gender'].astype('category')

df['gender'].dtype # CategoricalDtype(categories=['Male', 'Female'])

Creating a Categorical Series

data = ['A', 'A', 'C', 'B', 'C', 'A']
series = pd.Series(data, dtype='category')
ordered_series = pd.Categorical(data, categories=['C', 'B', 'A'], ordered=True)

Why do we use categorical: memory

adult = pd.read_csv("data/adult.csv")
adult['Marital Status'].nbytes # 260488

# Casting to category type
adult['Marital Status'] = adult['Marital Status'].astype('category')
adult['Marital Status'].nbytes # 32617

Specifying dtypes when reading data

dtype_dict = {'Marital Status': 'category'}

adult = pd.read_csv('data/adult.csv', dtype=dtype_dict)

Setting dtypes and saving memory

# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)

Creating a categorical pandas Series

# Create a categorical Series and specify the categories (let pandas know the order matters!)
medals = pd.Categorical(medals_won, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
print(medals)

Setting dtypes when reading data

# Check the dtypes
print(adult.dtypes)

# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {
   "Workclass": "category",
   "Education": "category",
   "Relationship": "category",
   "Above/Below 50k": "category" 
}

# Read in the CSV using the dtypes parameter
adult2 = pd.read_csv(
  "adult.csv",
  dtype=adult_dtypes
)
print(adult2.dtypes)

The basics of .groupby(): splitting data

groupby_object = adult.groupby(by=["Above/Below 50k"])

# Apply a function
groupby_object.mean()

# Specify Columns
groupby_object['Age', 'Education Num'].sum() # Options 1. Best Options
groupby_object.sum()[['Age', 'Education Num']] # Options 2

Setting up .groupby() statement

# Group the adult dataset by "Sex" and "Above/Below 50k"
gb = adult.groupby(by=['Sex', 'Above/Below 50k'])

# Print out how many rows are in each created group
print(gb.size())

# Print out the mean of each group for all columns
print(gb.mean())

Using pandas functions effectively

# Create a list of user-selected variables
user_list = ['Education', 'Above/Below 50k']

# Create a GroupBy object using this list
gb = adult.groupby(by=user_list)

# Find the mean for the variable "Hours/Week" for each group - Be efficient!
print(gb['Hours/Week'].mean())