Table of contents
Patterns over time
When data includes dates or time values, we'll want to examine whether there might be patterns over time
Importing DateTime data
df = pd.read_csv('datasets.csv', parse_dates=['birth_date'])
Converting to DateTime data
df['birth_date'] = pd.to_datetime(df['birth_date'])
Creating DateTime data
# Column must be month, date, year
df['birth_date'] = pd.to_datetime(['month', 'date', 'year'])
Extracting DateTime data
df['month'] = df['birth_date'].dt.month
df['date'] = df['birth_date'].dt.day
df['year'] = df['birth_date'].dt.year
Visualizing patterns over time
sns.lineplot(x='marriage_month', y='marriage_duration', data=df)
plt.show()
Importing DateTime data
# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv('divorce.csv', parse_dates=['divorce_date', 'marriage_date', 'dob_man', 'dob_woman'])
print(divorce.dtypes)
Updating data type to DateTime
# Convert the marriage_date column to DateTime values
divorce["marriage_date"] = pd.to_datetime(divorce['marriage_date'])
Visualizing relationships over time
# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year
# Create a line plot showing the average number of kids by year
sns.lineplot(x='marriage_year', y='num_kids', data=divorce)
plt.show()
Correlation
Describes the direction and strength of relationships between two variables
Help us use variables to predict future outcomes
.corr()
calculates the Pearson correlation coefficient, measuring linear relationships
Visualizing correlation
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.pairplot(data=divorce)
plt.show()
# Limit variables
sns.pairplot(data=divorce, vars=['income_man', 'income_woman', 'marriage_duration']
plt.show()
Visualizing variable relationships
# Create the scatterplot
sns.scatterplot(x='marriage_duration', y='num_kids', data=divorce)
plt.show()
Visualizing multiple variable relationships
# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce,vars=['income_woman', 'marriage_duration'])
plt.show()
Factor relationships and distributions
Categorial data in scatter plots
# Create the scatter plot
sns.scatterplot(x='woman_age_marriage', y='income_woman', data=divorce, hue='education_woman')
plt.show()
Exploring with KDE plots
# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()