Consideration for categorical data
Why perform EDA?
Detecting patterns and relationships
Generating questions or hypotheses
Prepare data for machine learning model
Representative data
- Sample represents the population
Categorical classes
- Classes = labels
Cross-tabulation
pd.crosstab(planes['Source'], planes['Destination'])
# Aggregate value with pd.crosstab
pd.crosstab(planes['Source'], planes['Destination'], values=planes['Price'], aggfunc="median")
Checking for class imbalance
# Print the relative frequency of Job_Category
print(salaries['Job_Category'].value_counts(normalize=True))
Cross-tabulation
#1
# Cross-tabulate Company_Size and Experience
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))
#2
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"]))
#3
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
values=salaries["Salary_USD"], aggfunc="mean"))
Generating new features
Extracting correlation
# Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month
# Extract the weekday of the response
salaries["weekday"] = salaries['date_of_response'].dt.weekday
# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()
Calculating salaries percentiles
# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)
# Save the median
salaries_median = salaries["Salary_USD"].median()
# Gather the 75th percentile
seventy_fifth = salaries['Salary_USD'].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)
Categorizing salaries
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]
# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]
# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
bins=salary_ranges,
labels=salary_labels)
# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()
Generating hypotheses
Comparing salaries
# Filter for employees in the US or GB
usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]
# Create a barplot of salaries by location
sns.barplot(data=usa_and_gb, x="Employee_Location", y="Salary_USD")
plt.show()
Choosing hypotheses
# Create a bar plot of salary versus company size, factoring in employment status
sns.barplot(data=salaries, x="Company_Size", y="Salary_USD", hue="Employment_Status")
plt.show()