Table of contents
Initial exploration
Explanatory Data Analysis (EDA)
The process of cleaning and reviewing data to...
derive insights, such as descriptive statistics and correlation
generate hypotheses for experiments
Results
Inform the next step for the dataset
Pandas method for initial exploration
head
We can use
head
method to take a look at the top of the DataFrame, we can see our data contains columns representation.df.head()
info
We can use
info
is a quick way to summarize the number of missing values in each column, the data types of each column, and memory usage.df.info()
value_counts
A common question about categorical data is how many data points we have in each category. We can use
value_counts
to answer the questiondf.value_counts('category')
describe
We can use
describe
getting summary statistics about our datasetsdf.describe()
Functions for initial exploration
#1
# Print the first five rows of unemployment
print(unemployment.head())
#2
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())
#3
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())
Counting categorical values
# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())
Global unemployment in 2021
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021',binwidth=True)
plt.show()
Data validation
Data validation is an important early step in EDA, we need to understand whether data types and ranges are as expected.
Validating data types
We can take a look at the data type of each column using info
or dtypes
to validate data types.
df.info()
# Data types only
df.dtypes
Updating data types
df['year'] = df['year'].astype(int)
Validating categorical data
We can validate categorical data by comparing values in a column to a list of expected values using isin
function.
df['gender'].isin(['Male', 'Female']) # is in ['Male', 'Female']
~df['gender'].isin(['Male', 'Female']) # is not in ['Male', 'Female']
Validating numeric data
df.select_dtypes('number') # to filter numberic data column only
df['year'].min() # Min year
df['year'].max() # Max year
Detecting data types
# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment['2019'].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)
Validating continents
#1
# Define a Series describing whether each continent is outside of Oceania
not_oceania = unemployment['continent'] != 'Oceania'
#2
# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])
Validating range
# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())
# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021',y='continent',data=unemployment)
plt.show()
Data summarization
Exploring group of data
.groupby()
groups data by categoryAggregate function indicates how to summarize grouped data
Aggregating functions
.sum()
.count()
.min()
.max()
.var()
.std()
Aggregating ungrouped data
books.agg(['mean', 'std'])
Specifying aggregations for columns
books.agg({'rating': ['mean', 'std'], 'year': ['median']})
Named summary columns
books.groupby('genre').agg(
mean_rating=('rating', 'mean'),
std_rating=('rating', 'std'),
median_year=('year', 'median')
)
Visualizing categorical summaries
sns.barplot(x='genre', y='rating', data = books)
plt.show()
Summaries with .groupby() and .agg()
#1
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))
#2
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))
Named aggregations
continent_summary = unemployment.groupby("continent").agg(
# Create the mean_rate_2021 column
mean_rate_2021 = ('2021', 'mean'),
# Create the std_rate_2021 column
std_rate_2021 = ('2021', 'std'),
)
print(continent_summary)
Visualizing categorical summaries
# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()