Data Cleaning and Imputation

Addressing missing data

Why missing data a problem?

Affects distributions
Less representative of the population
This can result in drawing incorrect conclusions

Checking missing values

df.isna()

Strategies for addressing missing data

Drop missing values – if missing values is 5% or less of total values
Replace with mean, median, mode – Depends on distribution and context
Impute by sub-group – Different experience levels have different median salary

Dropping missing values

threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threhold]
df.dropna(subset = cols_to_drop, inplace=True) # inplace = True, df should be updated

Imputing a summary statistics

cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
    df[col].fillna(df[col].mode()[0])

Imputing by sub-group

df_dict = df.groupby('category')['count'].median().to_dict()
df['count'] = df['count'].fillna(df['category'].map(df_dict))

Dealing with missing data

# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

Strategies for remaining missing data

# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x='Airline', y='Price')

plt.show()

Imputing missing plane prices

# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()

print(airline_prices)

# Convert to a dictionary
prices_dict = airline_prices.to_dict()

# Map the dictionary to missing values of Price by Airline
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))

# Check for missing values
print(planes.isna().sum())

Converting and analyzing categorical data

Previewing categorical data

df.select_dtypes("object").head()

Extracting value from categories

The current format limits our ability to generate insights
pandas.Series.str.contains, search column for a specific string or multiple strings

df['category'].str.contains('Scientist')
df['category'].str.contains('Machine Learning|AI') # Multiple phrases

Creating the categorical column

job_categories = [
    'Data Science',
    'Data Analytics',
    'Data Engineering',
    'Machine Learning',
    'Managerial',
    'Consultant'
]
conditions = [
    (df['Designation'].str.contains('Data Scientist|NLP')),
    (df['Designation'].str.contains('Analyst|Analytics')),
    (df['Designation'].str.contains('Data Engineer|ETL|Architect|Infrastructure')),
    (df['Designation'].str.contains('Machine')),
    (df['Designation'].str.contains('Manager|Head|Director|Lead|Principal|Staff')),
    (df['Designation'].str.contains('Consultant|Freelencer'))
]
df['Job_Category'] = np.select(conditions, job_categories,default='Other')

display(df[['Designation', 'Job_Category']].head())

Finding the number of unique values

# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:

  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

Flight duration category

# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short-haul values
short_flights = "0h|1h|2h|3h|4h"

# Create medium-haul values
medium_flights = "5h|6h|7h|8h|9h"

# Create long-haul values
long_flights = "10h|11h|12h|13h|14h|15h|16h"

Adding duration categories

# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()

Working with numeric data

Flight duration

# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)

# Plot a histogram
sns.histplot(data=planes,x='Duration')
plt.show()

Adding descriptive statistics

#1
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

#2
# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x : x.median())

print(planes[["Airline","airline_median_duration"]].value_counts())

#3
# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x : x.mean())

print(planes[["Destination","price_destination_mean"]].value_counts())

Handling outliers

What is an outlier?

Outlier is an observation far away from other data points

IQR

IQR = 75th - 25th percentile
Upper Outlier > 75th percentile + (1.5 * IQR)
Lower Outlier < 25th percentile - (1.5 * IQR)

seventy_fifth = df['Salary'].quantile(0.75)
twenty_fifth = df['Salary'].quantile(0.25)
iqr = seventy_fifth - twenty_fith
upper_threshold = seventy_fifth + (1.5 * iqr)
lower_threshold = twenty_fifth - (1.5 * iqr)

Why look for outliers?

Outliers are extreme values, and may not accurately represent our data
Can change the mean and the standard deviation
Statistical tests and machine learning models need normally distributed data

What to do about outliers?

Questions to ask:

Why do these outliers exist?
Is the data accurate

Identifying outliers

# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()

# Display descriptive statistics for flight duration
print(planes['Duration'].describe())

Removing outliers

# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]

print(planes["Price"].describe())

Data Cleaning and Imputation

Base on DataCamp

Table of contents

Addressing missing data

Why missing data a problem?

Checking missing values

Strategies for addressing missing data

Dropping missing values

Imputing a summary statistics

Imputing by sub-group

Dealing with missing data

Strategies for remaining missing data

Imputing missing plane prices

Converting and analyzing categorical data

Previewing categorical data

Extracting value from categories

Creating the categorical column

Finding the number of unique values

Flight duration category

Adding duration categories

Working with numeric data

Flight duration

Adding descriptive statistics

Handling outliers

What is an outlier?

IQR

Why look for outliers?

What to do about outliers?

Identifying outliers

Removing outliers