Getting to Know a Dataset

Initial exploration

Explanatory Data Analysis (EDA)

The process of cleaning and reviewing data to...

derive insights, such as descriptive statistics and correlation
generate hypotheses for experiments

Results

Inform the next step for the dataset

Pandas method for initial exploration

head

We can use head method to take a look at the top of the DataFrame, we can see our data contains columns representation.
```
  df.head()
```
info

We can use info is a quick way to summarize the number of missing values in each column, the data types of each column, and memory usage.
```
  df.info()
```
value_counts

A common question about categorical data is how many data points we have in each category. We can use value_counts to answer the question
```
  df.value_counts('category')
```
describe

We can use describe getting summary statistics about our datasets
```
  df.describe()
```

Functions for initial exploration

#1
# Print the first five rows of unemployment
print(unemployment.head())

#2
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

#3
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())

Counting categorical values

# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())

Global unemployment in 2021

# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021',binwidth=True)
plt.show()

Data validation

Data validation is an important early step in EDA, we need to understand whether data types and ranges are as expected.

Validating data types

We can take a look at the data type of each column using info or dtypes to validate data types.

df.info()

# Data types only
df.dtypes

Updating data types

df['year'] = df['year'].astype(int)

Validating categorical data

We can validate categorical data by comparing values in a column to a list of expected values using isin function.

df['gender'].isin(['Male', 'Female']) # is in ['Male', 'Female']

~df['gender'].isin(['Male', 'Female']) # is not in ['Male', 'Female']

Validating numeric data

df.select_dtypes('number') # to filter numberic data column only

df['year'].min() # Min year
df['year'].max() # Max year

Detecting data types

# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment['2019'].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)

Validating continents

#1
# Define a Series describing whether each continent is outside of Oceania
not_oceania = unemployment['continent'] != 'Oceania'

#2
# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

Validating range

# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021',y='continent',data=unemployment)
plt.show()

Data summarization

Exploring group of data

.groupby() groups data by category
Aggregate function indicates how to summarize grouped data

Aggregating functions

.sum()
.count()
.min()
.max()
.var()
.std()

Aggregating ungrouped data

books.agg(['mean', 'std'])

Specifying aggregations for columns

books.agg({'rating': ['mean', 'std'], 'year': ['median']})

Named summary columns

books.groupby('genre').agg(
    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std'),
    median_year=('year', 'median')
)

Visualizing categorical summaries

sns.barplot(x='genre', y='rating', data = books)
plt.show()

Summaries with .groupby() and .agg()

#1
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

#2
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))

Named aggregations

continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021 = ('2021', 'std'),
)
print(continent_summary)

Visualizing categorical summaries

# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()

Getting to Know a Dataset

Base on DataCamp

Table of contents

Initial exploration

Explanatory Data Analysis (EDA)

Pandas method for initial exploration

Functions for initial exploration

Counting categorical values

Global unemployment in 2021

Data validation

Validating data types

Updating data types

Validating categorical data

Validating numeric data

Detecting data types

Validating continents

Validating range

Data summarization

Exploring group of data

Aggregating functions

Aggregating ungrouped data

Specifying aggregations for columns

Named summary columns

Visualizing categorical summaries

Summaries with .groupby() and .agg()

Named aggregations

Visualizing categorical summaries