Correlation and Experimental Design

Base on Datacamp

Correlation

Relationship between two variables

Scatter Diagram (Scatter Plot or Correlation Chart): A Guide with Examples |

x = explanatory/independent variable

y = response/dependent variable

Correlation coefficient

  • Quantifies the linear relationship between two variables

  • Number between -1 and 1

  • Magnitude corresponds to strength of relationship

  • Sign (+ or -) corresponds to direction of relationship

  • The closer the correlation value to zero the weaker the correlation

Visualize relationships

import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()

Adding Trendline

By adding a trendline it can help us to easily find a correlation between two variable

import seaborn as sns
sns.lmplot(x='sleep_total', y='sleep_rem', data=msleep, ci=None)
plt.show()

Calculating correlation

msleep['sleep_total'].corr(msleep['sleep_rem']) # Out: 0.751755

msleep['sleep_rem'].corr(msleep['sleep_total']) # Out: 0.751755

Correlation between x and y == correlation between y and x

Relationship between variables

#1.
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

#2.
# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

#4
# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

Correlation caveats

  • Correlations only account for a linear relationship

  • Always visualize data when possible

  • x is correlated with y doesn't mean x causes y

  • Apply log when data is highly skewed, we can apply np.log transformation

  • Other Transformation:

    • Log transformation (log(x))

    • Square root transformation (sqrt(x))

    • Reciprocal transformation (1/x)

    • Combination of these, e.g.:

      • log(x) and sqrt(y)

      • sqrt(x) and 1/y

      • 1/x and log(y)

Why use transformation?

Certain statistical methods rely on variables having a linear relationship

  • Correlation coefficient

  • Linear regression

What can't correlation measure?

#1
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap',y='life_exp',data=world_happiness)

# Show plot
plt.show()

#2
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)

Transforming variable

#1
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['happiness_score'].corr(world_happiness['gdp_per_cap'])
print(cor) # Out: 0.727973301222298

#2
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])

# Scatterplot of happiness_score vs. log_gdp_per_cap
sns.scatterplot(x='log_gdp_per_cap',y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor) # Out: 0.8043146004918288

Does sugar improve happiness?

#1
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score',data=world_happiness)
plt.show()

# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)

Design of experiments

Controlled experiments

  • Participants are assigned by researchers to either a treatment group or a control group, e.g. A/B Testing

    • Treatment group sees an advertisement

    • Control group doesn't

  • Group should be comparable so that causation can be inferred, if not could lead to cofounding (bias)

Best practices

The best practice of experiments will eliminate as much bias as possible.

  • Less bias is more reliable

Tools

  • Randomize controlled trials

    • Participants are assigned to treatment/control randomly, not based on any other characteristics

    • Choosing randomly helps ensure that groups are comparable

  • Placebo

    • Resembles the treatment, but has no effect

    • Participants will not know which group they're in

  • Double-blind trials

    • Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo

    • Prevent bias in the response and/or analysis result

Observational study

  • Participants are not assigned randomly to groups

    • Participants assign themselves, usually based on pre-existing characteristics
  • Many research questions are not conducive to a controlled experiment

  • Establish association, not causation

Longitudinal vs cross-sectional studies

Longitudinal studies

  • Participants are followed over a certain period to examine the effect of treatment response

  • Effect of age on height is not cofounded by generation

  • Expensive and take a longer time

Cross-sectional studies

  • Data on participants is collected from a single snapshot in time

  • Effect of age on height is cofounded by generation

  • Affordable and take a shorter time