Relationships in Data

Patterns over time

When data includes dates or time values, we'll want to examine whether there might be patterns over time

Importing DateTime data

df = pd.read_csv('datasets.csv', parse_dates=['birth_date'])

Converting to DateTime data

df['birth_date'] = pd.to_datetime(df['birth_date'])

Creating DateTime data

# Column must be month, date, year
df['birth_date'] = pd.to_datetime(['month', 'date', 'year'])

Extracting DateTime data

df['month'] = df['birth_date'].dt.month
df['date'] = df['birth_date'].dt.day
df['year'] = df['birth_date'].dt.year

Visualizing patterns over time

sns.lineplot(x='marriage_month', y='marriage_duration', data=df)
plt.show()

Importing DateTime data

# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv('divorce.csv', parse_dates=['divorce_date', 'marriage_date', 'dob_man', 'dob_woman'])
print(divorce.dtypes)

Updating data type to DateTime

# Convert the marriage_date column to DateTime values
divorce["marriage_date"] = pd.to_datetime(divorce['marriage_date'])

Visualizing relationships over time

# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year

# Create a line plot showing the average number of kids by year
sns.lineplot(x='marriage_year', y='num_kids', data=divorce)
plt.show()

Correlation

Describes the direction and strength of relationships between two variables
Help us use variables to predict future outcomes
.corr() calculates the Pearson correlation coefficient, measuring linear relationships

Visualizing correlation

sns.heatmap(df.corr(), annot=True)
plt.show()

sns.pairplot(data=divorce)
plt.show()

# Limit variables
sns.pairplot(data=divorce, vars=['income_man', 'income_woman', 'marriage_duration']
plt.show()

Visualizing variable relationships

# Create the scatterplot
sns.scatterplot(x='marriage_duration', y='num_kids', data=divorce)
plt.show()

Visualizing multiple variable relationships

# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce,vars=['income_woman', 'marriage_duration'])
plt.show()

Factor relationships and distributions

Categorial data in scatter plots

# Create the scatter plot
sns.scatterplot(x='woman_age_marriage', y='income_woman', data=divorce, hue='education_woman')
plt.show()

Exploring with KDE plots

# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()

Relationships in Data

Base on DataCamp

Table of contents

Patterns over time

Importing DateTime data

Converting to DateTime data

Creating DateTime data

Extracting DateTime data

Visualizing patterns over time

Importing DateTime data

Updating data type to DateTime

Visualizing relationships over time

Correlation

Visualizing correlation

Visualizing variable relationships

Visualizing multiple variable relationships

Factor relationships and distributions

Categorial data in scatter plots

Exploring with KDE plots