Pitfalls and Encoding

Base on DataCamp

Categorical pitfalls

Using categories can be frustrating

  • Using the .str accessor object to manipulate data converts the Series to an object.

  • The .apply() method outputs a new Series as an object.

  • The common methods of adding, removing, replacing, or setting categories don't all handle missing categories the same way.

  • NumPy functions generally don't work with categorical series.

Overcoming pitfalls: string issues

# Print the frequency table of body_type and include NaN values
print(used_cars["body_type"].value_counts(dropna=False))

# Update NaN values
used_cars.loc[used_cars["body_type"].isna(), "body_type"] = "other"

# Convert body_type to title case
used_cars["body_type"] = used_cars["body_type"].str.title()

# Check the dtype
print(used_cars["body_type"].dtype)

Overcoming pitfalls: using NumPy arrays

# Print the frequency table of Sale Rating
print(used_cars["Sale Rating"].value_counts())

# Find the average score
average_score = used_cars["Sale Rating"].astype(int).mean()

# Print the average
print(average_score)

Label encoding

What is label encoding?

The Basics:

  • Codes each category as an integer from 0 through n - 1, where n is the number of categories.

  • A -1 code is reserved for any missing values.

  • Can save on memory

  • Often used in surveys

The Drawback:

  • Is not the best encoding for machine learning

Creating codes

used_cars["manufacturer_codes"] = used_cars["manufacturer_name"].cat.codes
codes = used_cars["manufacturer_name"].cat.codes
categories = used_cars["manufacturer_name"]
name_map = dict(zip(codes, categories))

Reverting codes to category name

codes = used_cars["manufacturer_name"].cat.codes
categories = used_cars["manufacturer_name"]
name_map = dict(zip(codes, categories))
used_cars["manufacturer_codes"].map(name_map)

Creating Boolean coding

used_cars["van_code"] = np.where(
    used_cars["body_type"].str.contains("van", regex=False), 1, 0
)

Create a label encoding and map

# Convert to categorical and print the frequency table
used_cars["color"] = used_cars["color"].astype("category")
print(used_cars["color"].value_counts())

# Create a label encoding
used_cars["color_code"] = used_cars["color"].cat.codes

# Create codes and categories objects
codes = used_cars["color"].cat.codes
categories = used_cars["color"]
color_map = dict(zip(codes, categories))

# Print the map
print(color_map)

Using saved mappings

# Update the color column using the color_map
used_cars_updated["color"] = used_cars_updated["color"].map(color_map)
# Update the engine fuel column using the fuel_map
used_cars_updated["engine_fuel"] = used_cars_updated["engine_fuel"].map(fuel_map)
# Update the transmission column using the transmission_map
used_cars_updated["transmission"] = used_cars_updated["transmission"].map(transmission_map)

# Print the info statement
print(used_cars_updated.info())

Creating a Boolean encoding

# Print the manufacturer name frequency table
print(used_cars["manufacturer_name"].value_counts())

# Create a Boolean column based on if the manufacturer name that contain Volkswagen
used_cars["is_volkswagen"] = np.where(
  used_cars["manufacturer_name"].str.contains("Volkswagen", regex=False), True, False
)

# Create a Boolean column based on if the manufacturer name that contain Volkswagen: using 0s an 1s
used_cars["is_volkswagen"] = np.where(
  used_cars["manufacturer_name"].str.contains("Volkswagen", regex=False), 1, 0
)

# Check the final frequency table
print(used_cars["is_volkswagen"].value_counts())

One-hot encoding

One-hot encoding is the process of creating dummy variables.

One-hot encoding with pandas

pd.get_dummies()

  • data: a pandas DataFrame

  • columns: a list-like object of column names

  • prefix: a string to add the beginning of each category.

One-hot encoding on a DataFrame

used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]], prefix="c")

One-hot encoding specific column

# Create one-hot encoding for just two columns
used_cars_simple = pd.get_dummies(
  used_cars,
  # Specify the columns from the instructions
  columns=["manufacturer_name", "transmission"],
  # Set the prefix
  prefix="dummy"
)

# Print the shape of the new dataset
print(used_cars_simple.shape)