Table of contents
Categorical pitfalls
Using categories can be frustrating
Using the
.str
accessor object to manipulate data converts the Series to an object.The
.apply()
method outputs a new Series as an object.The common methods of adding, removing, replacing, or setting categories don't all handle missing categories the same way.
NumPy functions generally don't work with categorical series.
Overcoming pitfalls: string issues
# Print the frequency table of body_type and include NaN values
print(used_cars["body_type"].value_counts(dropna=False))
# Update NaN values
used_cars.loc[used_cars["body_type"].isna(), "body_type"] = "other"
# Convert body_type to title case
used_cars["body_type"] = used_cars["body_type"].str.title()
# Check the dtype
print(used_cars["body_type"].dtype)
Overcoming pitfalls: using NumPy arrays
# Print the frequency table of Sale Rating
print(used_cars["Sale Rating"].value_counts())
# Find the average score
average_score = used_cars["Sale Rating"].astype(int).mean()
# Print the average
print(average_score)
Label encoding
What is label encoding?
The Basics:
Codes each category as an integer from
0
throughn - 1
, wheren
is the number of categories.A
-1
code is reserved for any missing values.Can save on memory
Often used in surveys
The Drawback:
- Is not the best encoding for machine learning
Creating codes
used_cars["manufacturer_codes"] = used_cars["manufacturer_name"].cat.codes
codes = used_cars["manufacturer_name"].cat.codes
categories = used_cars["manufacturer_name"]
name_map = dict(zip(codes, categories))
Reverting codes to category name
codes = used_cars["manufacturer_name"].cat.codes
categories = used_cars["manufacturer_name"]
name_map = dict(zip(codes, categories))
used_cars["manufacturer_codes"].map(name_map)
Creating Boolean coding
used_cars["van_code"] = np.where(
used_cars["body_type"].str.contains("van", regex=False), 1, 0
)
Create a label encoding and map
# Convert to categorical and print the frequency table
used_cars["color"] = used_cars["color"].astype("category")
print(used_cars["color"].value_counts())
# Create a label encoding
used_cars["color_code"] = used_cars["color"].cat.codes
# Create codes and categories objects
codes = used_cars["color"].cat.codes
categories = used_cars["color"]
color_map = dict(zip(codes, categories))
# Print the map
print(color_map)
Using saved mappings
# Update the color column using the color_map
used_cars_updated["color"] = used_cars_updated["color"].map(color_map)
# Update the engine fuel column using the fuel_map
used_cars_updated["engine_fuel"] = used_cars_updated["engine_fuel"].map(fuel_map)
# Update the transmission column using the transmission_map
used_cars_updated["transmission"] = used_cars_updated["transmission"].map(transmission_map)
# Print the info statement
print(used_cars_updated.info())
Creating a Boolean encoding
# Print the manufacturer name frequency table
print(used_cars["manufacturer_name"].value_counts())
# Create a Boolean column based on if the manufacturer name that contain Volkswagen
used_cars["is_volkswagen"] = np.where(
used_cars["manufacturer_name"].str.contains("Volkswagen", regex=False), True, False
)
# Create a Boolean column based on if the manufacturer name that contain Volkswagen: using 0s an 1s
used_cars["is_volkswagen"] = np.where(
used_cars["manufacturer_name"].str.contains("Volkswagen", regex=False), 1, 0
)
# Check the final frequency table
print(used_cars["is_volkswagen"].value_counts())
One-hot encoding
One-hot encoding is the process of creating dummy variables.
One-hot encoding with pandas
pd.get_dummies()
data
: apandas
DataFramecolumns
: a list-like object of column namesprefix
: a string to add the beginning of each category.
One-hot encoding on a DataFrame
used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]], prefix="c")
One-hot encoding specific column
# Create one-hot encoding for just two columns
used_cars_simple = pd.get_dummies(
used_cars,
# Specify the columns from the instructions
columns=["manufacturer_name", "transmission"],
# Set the prefix
prefix="dummy"
)
# Print the shape of the new dataset
print(used_cars_simple.shape)