Relationships in Data

Base on DataCamp

Patterns over time

When data includes dates or time values, we'll want to examine whether there might be patterns over time

Importing DateTime data

df = pd.read_csv('datasets.csv', parse_dates=['birth_date'])

Converting to DateTime data

df['birth_date'] = pd.to_datetime(df['birth_date'])

Creating DateTime data

# Column must be month, date, year
df['birth_date'] = pd.to_datetime(['month', 'date', 'year'])

Extracting DateTime data

df['month'] = df['birth_date'].dt.month
df['date'] = df['birth_date'].dt.day
df['year'] = df['birth_date'].dt.year

Visualizing patterns over time

sns.lineplot(x='marriage_month', y='marriage_duration', data=df)
plt.show()

Importing DateTime data

# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv('divorce.csv', parse_dates=['divorce_date', 'marriage_date', 'dob_man', 'dob_woman'])
print(divorce.dtypes)

Updating data type to DateTime

# Convert the marriage_date column to DateTime values
divorce["marriage_date"] = pd.to_datetime(divorce['marriage_date'])

Visualizing relationships over time

# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year

# Create a line plot showing the average number of kids by year
sns.lineplot(x='marriage_year', y='num_kids', data=divorce)
plt.show()

Correlation

  • Describes the direction and strength of relationships between two variables

  • Help us use variables to predict future outcomes

  • .corr() calculates the Pearson correlation coefficient, measuring linear relationships

Visualizing correlation

sns.heatmap(df.corr(), annot=True)
plt.show()
sns.pairplot(data=divorce)
plt.show()

# Limit variables
sns.pairplot(data=divorce, vars=['income_man', 'income_woman', 'marriage_duration']
plt.show()

Visualizing variable relationships

# Create the scatterplot
sns.scatterplot(x='marriage_duration', y='num_kids', data=divorce)
plt.show()

Visualizing multiple variable relationships

# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce,vars=['income_woman', 'marriage_duration'])
plt.show()

Factor relationships and distributions

Categorial data in scatter plots

# Create the scatter plot
sns.scatterplot(x='woman_age_marriage', y='income_woman', data=divorce, hue='education_woman')
plt.show()

Exploring with KDE plots

# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()