Categorical pandas Series

Base on DataCamp

Setting category variables

The .cat accessor object

Series.cat.method_name

Common parameters:

  • new_categories: a list of categories

  • inplace: Boolean - whether or not the update should overwrite the series

  • ordered: Boolean - whether or not the categorical is treated as on ordered categorical

Setting Series categories

dogs['coat'] = dogs['coat'].cat.set_categories(
    new_categories=['short', 'medium', 'long'],
    ordered=True # Default false
)

Adding categories

dogs['coat'] = dogs['coat'].cat.add_categories(
    new_categories=['unknown']
)

Removing categories

dogs['coat'] = dogs['coat'].cat.remove_categories(
    removals=['unknown']
)

Exercise - Adding categories

# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")

# Add new categories
new_categories = ["Unknown History", "Open Yard (Countryside)"]
dogs["keep_in"] = dogs["keep_in"].cat.add_categories(new_categories)

# Check frequency counts one more time
print(dogs["keep_in"].value_counts(dropna=False))

Exercise - Removing categories

# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Print out categories
print(dogs["likes_children"].cat.categories)

# Print the frequency table
print(dogs["likes_children"].value_counts())

# Remove the `"maybe" category
dogs["likes_children"] = dogs["likes_children"].cat.remove_categories(["maybe"])
print(dogs["likes_children"].value_counts())

# Print the categories one more time
print(dogs["likes_children"].cat.categories)

Updating categories

Renaming categories

changes_dict = {"Unknown Mix": "Unknown"}
dogs["breed"] = dogs["breed"].cat.rename_categories(new_categories=changes_dict)

# Rename with lambda
dogs["Sex"] = dogs["Sex"].cat.rename_categories(lambda c : c.title())

Collapsing Categories

replacements = {
    "black and brown": "black",
    "black and tan": "black",
    "black and white": "black"
}
dogs["main_color"] = dogs["color"].replace(replacements)

Exercise - Renaming Categories

# Create the my_changes dictionary
my_changes = {
    "Maybe?": "Maybe"
}

# Rename the categories listed in the my_changes dictionary
dogs["likes_children"] = dogs["likes_children"].cat.rename_categories(my_changes).astype('category')

# Use a lambda function to convert all categories to uppercase using upper()
dogs["likes_children"] =  dogs["likes_children"].cat.rename_categories(lambda c: c.upper())

# Print the list of categories
print(dogs["likes_children"].cat.categories)
# Create the update_coats dictionary
update_coats = {"wirehaired": "medium", "medium-long": "medium"}

# Create a new column, coat_collapsed
dogs["coat_collapsed"] = dogs["coat"].replace(update_coats)

# Convert the column to categorical
dogs["coat_collapsed"] = dogs["coat_collapsed"].astype("category")

# Print the frequency table
print(dogs["coat_collapsed"].value_counts())

Reordering categories

Why would you reorder?

  1. Creating an ordinal variable

  2. To set the order that variables are displayed in the analysis

  3. Memory savings

Reordering in pandas

dogs["coat"] = dogs["coat"].cat.reorder_categories(
    new_categories=["short", "medium", "wirehaired", "long"],
    ordered=True
)

# inplace
dogs["coat"].cat.reorder_categories(
    new_categories=["short", "medium", "wirehaired", "long"],
    ordered=True,
    inplace=True
)

Reordering categories in a Series

# Print out the current categories of the size variable
print(dogs["size"].cat.categories)

# Reorder the categories, specifying the Series is ordinal, and overwriting the original series
dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,
  inplace=True
)

Using .groupby after reordering

# Previous code
dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,
  inplace=True
)

# How many Male/Female dogs are available of each size?
print(dogs.groupby("size")["sex"].value_counts())

# Do larger dogs need more room to roam?
print(dogs.groupby("size")["keep_in"].value_counts())

Cleaning and accessing data

Possible issues with categorical data

  1. Inconsistent values: "Ham", "ham", " Ham"

  2. Misspelled values: "Ham", "Hma"

  3. Wrong data types: df["column_name"].dtype is not categorical

Identifying issues

  • Series.cat.categories

  • Series.value_counts()

Fixing issues

  • Whitespace

      dogs["get_along_cats"] = dogs["get_along_cats"].str.strip()
    
  • Capitalization: .title(), .upper(), .lower()

      dogs["get_along_cats"] = dogs["get_along_cats"].str.title()
    
  • Misspelled words: .replace()

      replace_dict = {"Noo": "No"}
      dogs["get_along_cats"].replace(replace_dict,inplace=True)
    
  • Wrong data types

      dogs["get_along_cats"] = dogs["get_along_cats"].astype("category")
    

Exercise - Cleaning variables

# Fix the misspelled word
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()

# Make all responses lowercase
dogs["sex"] = dogs["sex"].str.lower()

# Convert to a categorical Series
dogs["sex"] = dogs["sex"].astype("category")

print(dogs["sex"].value_counts())

Exercise - Accessing and filtering data

#1
# Print the category of the coat for ID 23807
print(dogs.loc[23807, "coat"])

#2
# Find the count of male and female dogs who have a "long" coat
print(dogs.loc[dogs["coat"] == "long", "sex"].value_counts())

#3
# Print the mean age of dogs with a breed of "English Cocker Spaniel"
print(dogs.loc[dogs["breed"] == "English Cocker Spaniel", "age"].mean())

#4
# Count the number of dogs that have "English" in their breed name
print(dogs[dogs["breed"].str.contains("English", regex=False)].shape[0])