Addressing missing data
Why missing data a problem?
Affects distributions
Less representative of the population
This can result in drawing incorrect conclusions
Checking missing values
df.isna()
Strategies for addressing missing data
Drop missing values – if missing values is 5% or less of total values
Replace with mean, median, mode – Depends on distribution and context
Impute by sub-group – Different experience levels have different median salary
Dropping missing values
threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threhold]
df.dropna(subset = cols_to_drop, inplace=True) # inplace = True, df should be updated
Imputing a summary statistics
cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
df[col].fillna(df[col].mode()[0])
Imputing by sub-group
df_dict = df.groupby('category')['count'].median().to_dict()
df['count'] = df['count'].fillna(df['category'].map(df_dict))
Dealing with missing data
# Count the number of missing values in each column
print(planes.isna().sum())
# Find the five percent threshold
threshold = len(planes) * 0.05
# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]
# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)
print(planes.isna().sum())
Strategies for remaining missing data
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())
# Create a box plot of Price by Airline
sns.boxplot(data=planes, x='Airline', y='Price')
plt.show()
Imputing missing plane prices
# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()
print(airline_prices)
# Convert to a dictionary
prices_dict = airline_prices.to_dict()
# Map the dictionary to missing values of Price by Airline
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))
# Check for missing values
print(planes.isna().sum())
Converting and analyzing categorical data
Previewing categorical data
df.select_dtypes("object").head()
Extracting value from categories
The current format limits our ability to generate insights
pandas.Series.str.contains
, search column for a specific string or multiple strings
df['category'].str.contains('Scientist')
df['category'].str.contains('Machine Learning|AI') # Multiple phrases
Creating the categorical column
job_categories = [
'Data Science',
'Data Analytics',
'Data Engineering',
'Machine Learning',
'Managerial',
'Consultant'
]
conditions = [
(df['Designation'].str.contains('Data Scientist|NLP')),
(df['Designation'].str.contains('Analyst|Analytics')),
(df['Designation'].str.contains('Data Engineer|ETL|Architect|Infrastructure')),
(df['Designation'].str.contains('Machine')),
(df['Designation'].str.contains('Manager|Head|Director|Lead|Principal|Staff')),
(df['Designation'].str.contains('Consultant|Freelencer'))
]
df['Job_Category'] = np.select(conditions, job_categories,default='Other')
display(df[['Designation', 'Job_Category']].head())
Finding the number of unique values
# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")
# Loop through columns
for col in non_numeric.columns:
# Print the number of unique values
print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())
Flight duration category
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]
# Create short-haul values
short_flights = "0h|1h|2h|3h|4h"
# Create medium-haul values
medium_flights = "5h|6h|7h|8h|9h"
# Create long-haul values
long_flights = "10h|11h|12h|13h|14h|15h|16h"
Adding duration categories
# Create conditions for values in flight_categories to be created
conditions = [
(planes["Duration"].str.contains(short_flights)),
(planes["Duration"].str.contains(medium_flights)),
(planes["Duration"].str.contains(long_flights))
]
# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions,
flight_categories,
default="Extreme duration")
# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()
Working with numeric data
Flight duration
# Preview the column
print(planes["Duration"].head())
# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")
# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)
# Plot a histogram
sns.histplot(data=planes,x='Duration')
plt.show()
Adding descriptive statistics
#1
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())
print(planes[["Airline", "airline_price_st_dev"]].value_counts())
#2
# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x : x.median())
print(planes[["Airline","airline_median_duration"]].value_counts())
#3
# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x : x.mean())
print(planes[["Destination","price_destination_mean"]].value_counts())
Handling outliers
What is an outlier?
Outlier is an observation far away from other data points
IQR
IQR = 75th - 25th percentile
Upper Outlier > 75th percentile + (1.5 * IQR)
Lower Outlier < 25th percentile - (1.5 * IQR)
seventy_fifth = df['Salary'].quantile(0.75)
twenty_fifth = df['Salary'].quantile(0.25)
iqr = seventy_fifth - twenty_fith
upper_threshold = seventy_fifth + (1.5 * iqr)
lower_threshold = twenty_fifth - (1.5 * iqr)
Why look for outliers?
Outliers are extreme values, and may not accurately represent our data
Can change the mean and the standard deviation
Statistical tests and machine learning models need normally distributed data
What to do about outliers?
Questions to ask:
Why do these outliers exist?
Is the data accurate
Identifying outliers
# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()
# Display descriptive statistics for flight duration
print(planes['Duration'].describe())
Removing outliers
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)
# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth
# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)
# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]
print(planes["Price"].describe())