Data Cleaning and Imputation

Photo by Anton on Unsplash

Data Cleaning and Imputation

Base on DataCamp

Addressing missing data

Why missing data a problem?

  • Affects distributions

  • Less representative of the population

  • This can result in drawing incorrect conclusions

Checking missing values

df.isna()

Strategies for addressing missing data

  • Drop missing values – if missing values is 5% or less of total values

  • Replace with mean, median, mode – Depends on distribution and context

  • Impute by sub-group – Different experience levels have different median salary

Dropping missing values

threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threhold]
df.dropna(subset = cols_to_drop, inplace=True) # inplace = True, df should be updated

Imputing a summary statistics

cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
    df[col].fillna(df[col].mode()[0])

Imputing by sub-group

df_dict = df.groupby('category')['count'].median().to_dict()
df['count'] = df['count'].fillna(df['category'].map(df_dict))

Dealing with missing data

# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

Strategies for remaining missing data

# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x='Airline', y='Price')

plt.show()

Imputing missing plane prices

# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()

print(airline_prices)

# Convert to a dictionary
prices_dict = airline_prices.to_dict()

# Map the dictionary to missing values of Price by Airline
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))

# Check for missing values
print(planes.isna().sum())

Converting and analyzing categorical data

Previewing categorical data

df.select_dtypes("object").head()

Extracting value from categories

  • The current format limits our ability to generate insights

  • pandas.Series.str.contains, search column for a specific string or multiple strings

df['category'].str.contains('Scientist')
df['category'].str.contains('Machine Learning|AI') # Multiple phrases

Creating the categorical column

job_categories = [
    'Data Science',
    'Data Analytics',
    'Data Engineering',
    'Machine Learning',
    'Managerial',
    'Consultant'
]
conditions = [
    (df['Designation'].str.contains('Data Scientist|NLP')),
    (df['Designation'].str.contains('Analyst|Analytics')),
    (df['Designation'].str.contains('Data Engineer|ETL|Architect|Infrastructure')),
    (df['Designation'].str.contains('Machine')),
    (df['Designation'].str.contains('Manager|Head|Director|Lead|Principal|Staff')),
    (df['Designation'].str.contains('Consultant|Freelencer'))
]
df['Job_Category'] = np.select(conditions, job_categories,default='Other')

display(df[['Designation', 'Job_Category']].head())

Finding the number of unique values

# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:

  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

Flight duration category

# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short-haul values
short_flights = "0h|1h|2h|3h|4h"

# Create medium-haul values
medium_flights = "5h|6h|7h|8h|9h"

# Create long-haul values
long_flights = "10h|11h|12h|13h|14h|15h|16h"

Adding duration categories

# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()

Working with numeric data

Flight duration

# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)

# Plot a histogram
sns.histplot(data=planes,x='Duration')
plt.show()

Adding descriptive statistics

#1
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

#2
# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x : x.median())

print(planes[["Airline","airline_median_duration"]].value_counts())

#3
# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x : x.mean())

print(planes[["Destination","price_destination_mean"]].value_counts())

Handling outliers

What is an outlier?

Outlier is an observation far away from other data points

IQR

  • IQR = 75th - 25th percentile

  • Upper Outlier > 75th percentile + (1.5 * IQR)

  • Lower Outlier < 25th percentile - (1.5 * IQR)

seventy_fifth = df['Salary'].quantile(0.75)
twenty_fifth = df['Salary'].quantile(0.25)
iqr = seventy_fifth - twenty_fith
upper_threshold = seventy_fifth + (1.5 * iqr)
lower_threshold = twenty_fifth - (1.5 * iqr)

Why look for outliers?

  • Outliers are extreme values, and may not accurately represent our data

  • Can change the mean and the standard deviation

  • Statistical tests and machine learning models need normally distributed data

What to do about outliers?

Questions to ask:

  • Why do these outliers exist?

  • Is the data accurate

Identifying outliers

# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()

# Display descriptive statistics for flight duration
print(planes['Duration'].describe())

Removing outliers

# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]

print(planes["Price"].describe())