Setting category variables
The .cat accessor object
Series.cat.method_name
Common parameters:
new_categories
: a list of categoriesinplace
: Boolean - whether or not the update should overwrite the seriesordered
: Boolean - whether or not the categorical is treated as on ordered categorical
Setting Series categories
dogs['coat'] = dogs['coat'].cat.set_categories(
new_categories=['short', 'medium', 'long'],
ordered=True # Default false
)
Adding categories
dogs['coat'] = dogs['coat'].cat.add_categories(
new_categories=['unknown']
)
Removing categories
dogs['coat'] = dogs['coat'].cat.remove_categories(
removals=['unknown']
)
Exercise - Adding categories
# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))
# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")
# Add new categories
new_categories = ["Unknown History", "Open Yard (Countryside)"]
dogs["keep_in"] = dogs["keep_in"].cat.add_categories(new_categories)
# Check frequency counts one more time
print(dogs["keep_in"].value_counts(dropna=False))
Exercise - Removing categories
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"
# Print out categories
print(dogs["likes_children"].cat.categories)
# Print the frequency table
print(dogs["likes_children"].value_counts())
# Remove the `"maybe" category
dogs["likes_children"] = dogs["likes_children"].cat.remove_categories(["maybe"])
print(dogs["likes_children"].value_counts())
# Print the categories one more time
print(dogs["likes_children"].cat.categories)
Updating categories
Renaming categories
changes_dict = {"Unknown Mix": "Unknown"}
dogs["breed"] = dogs["breed"].cat.rename_categories(new_categories=changes_dict)
# Rename with lambda
dogs["Sex"] = dogs["Sex"].cat.rename_categories(lambda c : c.title())
Collapsing Categories
replacements = {
"black and brown": "black",
"black and tan": "black",
"black and white": "black"
}
dogs["main_color"] = dogs["color"].replace(replacements)
Exercise - Renaming Categories
# Create the my_changes dictionary
my_changes = {
"Maybe?": "Maybe"
}
# Rename the categories listed in the my_changes dictionary
dogs["likes_children"] = dogs["likes_children"].cat.rename_categories(my_changes).astype('category')
# Use a lambda function to convert all categories to uppercase using upper()
dogs["likes_children"] = dogs["likes_children"].cat.rename_categories(lambda c: c.upper())
# Print the list of categories
print(dogs["likes_children"].cat.categories)
# Create the update_coats dictionary
update_coats = {"wirehaired": "medium", "medium-long": "medium"}
# Create a new column, coat_collapsed
dogs["coat_collapsed"] = dogs["coat"].replace(update_coats)
# Convert the column to categorical
dogs["coat_collapsed"] = dogs["coat_collapsed"].astype("category")
# Print the frequency table
print(dogs["coat_collapsed"].value_counts())
Reordering categories
Why would you reorder?
Creating an ordinal variable
To set the order that variables are displayed in the analysis
Memory savings
Reordering in pandas
dogs["coat"] = dogs["coat"].cat.reorder_categories(
new_categories=["short", "medium", "wirehaired", "long"],
ordered=True
)
# inplace
dogs["coat"].cat.reorder_categories(
new_categories=["short", "medium", "wirehaired", "long"],
ordered=True,
inplace=True
)
Reordering categories in a Series
# Print out the current categories of the size variable
print(dogs["size"].cat.categories)
# Reorder the categories, specifying the Series is ordinal, and overwriting the original series
dogs["size"].cat.reorder_categories(
new_categories=["small", "medium", "large"],
ordered=True,
inplace=True
)
Using .groupby after reordering
# Previous code
dogs["size"].cat.reorder_categories(
new_categories=["small", "medium", "large"],
ordered=True,
inplace=True
)
# How many Male/Female dogs are available of each size?
print(dogs.groupby("size")["sex"].value_counts())
# Do larger dogs need more room to roam?
print(dogs.groupby("size")["keep_in"].value_counts())
Cleaning and accessing data
Possible issues with categorical data
Inconsistent values:
"Ham"
,"ham"
," Ham"
Misspelled values:
"Ham"
,"Hma"
Wrong data types:
df["column_name"].dtype
is not categorical
Identifying issues
Series.cat.categories
Series.value_counts()
Fixing issues
Whitespace
dogs["get_along_cats"] = dogs["get_along_cats"].str.strip()
Capitalization:
.title()
,.upper()
,.lower()
dogs["get_along_cats"] = dogs["get_along_cats"].str.title()
Misspelled words:
.replace()
replace_dict = {"Noo": "No"} dogs["get_along_cats"].replace(replace_dict,inplace=True)
Wrong data types
dogs["get_along_cats"] = dogs["get_along_cats"].astype("category")
Exercise - Cleaning variables
# Fix the misspelled word
replace_map = {"Malez": "male"}
# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)
# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()
# Make all responses lowercase
dogs["sex"] = dogs["sex"].str.lower()
# Convert to a categorical Series
dogs["sex"] = dogs["sex"].astype("category")
print(dogs["sex"].value_counts())
Exercise - Accessing and filtering data
#1
# Print the category of the coat for ID 23807
print(dogs.loc[23807, "coat"])
#2
# Find the count of male and female dogs who have a "long" coat
print(dogs.loc[dogs["coat"] == "long", "sex"].value_counts())
#3
# Print the mean age of dogs with a breed of "English Cocker Spaniel"
print(dogs.loc[dogs["breed"] == "English Cocker Spaniel", "age"].mean())
#4
# Count the number of dogs that have "English" in their breed name
print(dogs[dogs["breed"].str.contains("English", regex=False)].shape[0])