Turning Exploratory into Action

Consideration for categorical data

Why perform EDA?

Detecting patterns and relationships
Generating questions or hypotheses
Prepare data for machine learning model

Representative data

Sample represents the population

Categorical classes

Classes = labels

Cross-tabulation

pd.crosstab(planes['Source'], planes['Destination'])

# Aggregate value with pd.crosstab
pd.crosstab(planes['Source'], planes['Destination'], values=planes['Price'], aggfunc="median")

Checking for class imbalance

# Print the relative frequency of Job_Category
print(salaries['Job_Category'].value_counts(normalize=True))

Cross-tabulation

#1
# Cross-tabulate Company_Size and Experience
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))

#2
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"]))

#3
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
            values=salaries["Salary_USD"], aggfunc="mean"))

Generating new features

Extracting correlation

# Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month

# Extract the weekday of the response
salaries["weekday"] = salaries['date_of_response'].dt.weekday

# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()

Calculating salaries percentiles

# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)

# Save the median
salaries_median = salaries["Salary_USD"].median()

# Gather the 75th percentile
seventy_fifth = salaries['Salary_USD'].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)

Categorizing salaries

# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)

# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()

Generating hypotheses

Comparing salaries

# Filter for employees in the US or GB
usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]

# Create a barplot of salaries by location
sns.barplot(data=usa_and_gb, x="Employee_Location", y="Salary_USD")
plt.show()

Choosing hypotheses

# Create a bar plot of salary versus company size, factoring in employment status
sns.barplot(data=salaries, x="Company_Size", y="Salary_USD", hue="Employment_Status")
plt.show()

Turning Exploratory into Action

Base on DataCamp

Table of contents

Consideration for categorical data

Why perform EDA?

Representative data

Categorical classes

Cross-tabulation

Checking for class imbalance

Cross-tabulation

Generating new features

Extracting correlation

Calculating salaries percentiles

Categorizing salaries

Generating hypotheses

Comparing salaries

Choosing hypotheses