Introduction to Categorical Data

Photo by v2osk on Unsplash

Introduction to Categorical Data

Base on DataCamp

Introduction

What does it mean to be "categorical"?

Categorical

  • Finite number of groups (or categories)

  • These categories are usually fixed or known (gender, hair color, etc.)

Numerical

  • Known as qualitative data

  • Expressed using a numerical value

  • Usually a measurement (height, weight, IQ, etc.)

Ordinal vs nominal variables

Ordinal

  • Categorical variables that have a natural order. e.g. Strongly Disagree (1), Disagree (2), Neutral (3), Agree (4), Strongly Agree (5).

Nominal

  • Categorical variables that can't be placed into a natural order. e.g. Male, Female.

Our first dataset

dtype object is how pandas store strings and is a good indicator that a variable might be categorical.

We can explore further using the following method:

  • describe()

      adult['Marital Status'].describe()
    

    Out:

    Count: 32561

    Unique: 7

    Top: married-civ-spouse

    Freq: 14976

    Name: Marital Status, dtype: object

  • value_counts()

    This method is to get the frequency table in a panda series

      adult['Marital Status'].value_counts()
      # With Normalize
      adult['Marital Status'].value_counts(normalize=True)
    

Exploring a target variable

# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

# Print relative frequency values
print(adult['Above/Below 50k'].value_counts(normalize=True))

Categorical data in pandas

By default, columns containing strings are not stored using pandas category dtype, as not every column containing strings needs to be categorical.

dtype: categorical

df['gender'] = df['gender'].astype('category')

df['gender'].dtype # CategoricalDtype(categories=['Male', 'Female'])

Creating a Categorical Series

data = ['A', 'A', 'C', 'B', 'C', 'A']
series = pd.Series(data, dtype='category')
ordered_series = pd.Categorical(data, categories=['C', 'B', 'A'], ordered=True)

Why do we use categorical: memory

adult = pd.read_csv("data/adult.csv")
adult['Marital Status'].nbytes # 260488

# Casting to category type
adult['Marital Status'] = adult['Marital Status'].astype('category')
adult['Marital Status'].nbytes # 32617

Specifying dtypes when reading data

dtype_dict = {'Marital Status': 'category'}

adult = pd.read_csv('data/adult.csv', dtype=dtype_dict)

Setting dtypes and saving memory

# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)

Creating a categorical pandas Series

# Create a categorical Series and specify the categories (let pandas know the order matters!)
medals = pd.Categorical(medals_won, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
print(medals)

Setting dtypes when reading data

# Check the dtypes
print(adult.dtypes)

# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {
   "Workclass": "category",
   "Education": "category",
   "Relationship": "category",
   "Above/Below 50k": "category" 
}

# Read in the CSV using the dtypes parameter
adult2 = pd.read_csv(
  "adult.csv",
  dtype=adult_dtypes
)
print(adult2.dtypes)

The basics of .groupby(): splitting data

groupby_object = adult.groupby(by=["Above/Below 50k"])

# Apply a function
groupby_object.mean()

# Specify Columns
groupby_object['Age', 'Education Num'].sum() # Options 1. Best Options
groupby_object.sum()[['Age', 'Education Num']] # Options 2

Setting up .groupby() statement

# Group the adult dataset by "Sex" and "Above/Below 50k"
gb = adult.groupby(by=['Sex', 'Above/Below 50k'])

# Print out how many rows are in each created group
print(gb.size())

# Print out the mean of each group for all columns
print(gb.mean())

Using pandas functions effectively

# Create a list of user-selected variables
user_list = ['Education', 'Above/Below 50k']

# Create a GroupBy object using this list
gb = adult.groupby(by=user_list)

# Find the mean for the variable "Hours/Week" for each group - Be efficient!
print(gb['Hours/Week'].mean())