Getting to Know a Dataset

Base on DataCamp

Initial exploration

Explanatory Data Analysis (EDA)

The process of cleaning and reviewing data to...

  • derive insights, such as descriptive statistics and correlation

  • generate hypotheses for experiments

Results

Inform the next step for the dataset

Pandas method for initial exploration

  • head

    We can use head method to take a look at the top of the DataFrame, we can see our data contains columns representation.

      df.head()
    
  • info

    We can use info is a quick way to summarize the number of missing values in each column, the data types of each column, and memory usage.

      df.info()
    
  • value_counts

    A common question about categorical data is how many data points we have in each category. We can use value_counts to answer the question

      df.value_counts('category')
    
  • describe

    We can use describe getting summary statistics about our datasets

      df.describe()
    

Functions for initial exploration

#1
# Print the first five rows of unemployment
print(unemployment.head())

#2
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

#3
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())

Counting categorical values

# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())

Global unemployment in 2021

# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021',binwidth=True)
plt.show()

Data validation

Data validation is an important early step in EDA, we need to understand whether data types and ranges are as expected.

Validating data types

We can take a look at the data type of each column using info or dtypes to validate data types.

df.info()

# Data types only
df.dtypes

Updating data types

df['year'] = df['year'].astype(int)

Validating categorical data

We can validate categorical data by comparing values in a column to a list of expected values using isin function.

df['gender'].isin(['Male', 'Female']) # is in ['Male', 'Female']

~df['gender'].isin(['Male', 'Female']) # is not in ['Male', 'Female']

Validating numeric data

df.select_dtypes('number') # to filter numberic data column only

df['year'].min() # Min year
df['year'].max() # Max year

Detecting data types

# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment['2019'].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)

Validating continents

#1
# Define a Series describing whether each continent is outside of Oceania
not_oceania = unemployment['continent'] != 'Oceania'

#2
# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

Validating range

# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021',y='continent',data=unemployment)
plt.show()

Data summarization

Exploring group of data

  • .groupby() groups data by category

  • Aggregate function indicates how to summarize grouped data

Aggregating functions

  • .sum()

  • .count()

  • .min()

  • .max()

  • .var()

  • .std()

Aggregating ungrouped data

books.agg(['mean', 'std'])

Specifying aggregations for columns

books.agg({'rating': ['mean', 'std'], 'year': ['median']})

Named summary columns

books.groupby('genre').agg(
    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std'),
    median_year=('year', 'median')
)

Visualizing categorical summaries

sns.barplot(x='genre', y='rating', data = books)
plt.show()

Summaries with .groupby() and .agg()

#1
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

#2
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))

Named aggregations

continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021 = ('2021', 'std'),
)
print(continent_summary)

Visualizing categorical summaries

# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()