Correlation and Experimental Design

Correlation

Relationship between two variables

Scatter Diagram (Scatter Plot or Correlation Chart): A Guide with Examples |

x = explanatory/independent variable

y = response/dependent variable

Correlation coefficient

Quantifies the linear relationship between two variables
Number between -1 and 1
Magnitude corresponds to strength of relationship
Sign (+ or -) corresponds to direction of relationship
The closer the correlation value to zero the weaker the correlation

Visualize relationships

import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()

Adding Trendline

By adding a trendline it can help us to easily find a correlation between two variable

import seaborn as sns
sns.lmplot(x='sleep_total', y='sleep_rem', data=msleep, ci=None)
plt.show()

Calculating correlation

msleep['sleep_total'].corr(msleep['sleep_rem']) # Out: 0.751755

msleep['sleep_rem'].corr(msleep['sleep_total']) # Out: 0.751755

Correlation between x and y == correlation between y and x

Relationship between variables

#1.
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

#2.
# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

#4
# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

Correlation caveats

Correlations only account for a linear relationship
Always visualize data when possible
x is correlated with y doesn't mean x causes y
Apply log when data is highly skewed, we can apply np.log transformation
Other Transformation:
- Log transformation (log(x))
- Square root transformation (sqrt(x))
- Reciprocal transformation (1/x)
- Combination of these, e.g.:
  - log(x) and sqrt(y)
  - sqrt(x) and 1/y
  - 1/x and log(y)

Why use transformation?

Certain statistical methods rely on variables having a linear relationship

Correlation coefficient
Linear regression

What can't correlation measure?

#1
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap',y='life_exp',data=world_happiness)

# Show plot
plt.show()

#2
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)

Transforming variable

#1
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['happiness_score'].corr(world_happiness['gdp_per_cap'])
print(cor) # Out: 0.727973301222298

#2
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])

# Scatterplot of happiness_score vs. log_gdp_per_cap
sns.scatterplot(x='log_gdp_per_cap',y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor) # Out: 0.8043146004918288

Does sugar improve happiness?

#1
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score',data=world_happiness)
plt.show()

# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)

Design of experiments

Controlled experiments

Participants are assigned by researchers to either a treatment group or a control group, e.g. A/B Testing
- Treatment group sees an advertisement
- Control group doesn't
Group should be comparable so that causation can be inferred, if not could lead to cofounding (bias)

Best practices

The best practice of experiments will eliminate as much bias as possible.

Less bias is more reliable

Tools

Randomize controlled trials
- Participants are assigned to treatment/control randomly, not based on any other characteristics
- Choosing randomly helps ensure that groups are comparable
Placebo
- Resembles the treatment, but has no effect
- Participants will not know which group they're in
Double-blind trials
- Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo
- Prevent bias in the response and/or analysis result

Observational study

Participants are not assigned randomly to groups
- Participants assign themselves, usually based on pre-existing characteristics
Many research questions are not conducive to a controlled experiment
Establish association, not causation

Longitudinal vs cross-sectional studies

Longitudinal studies

Participants are followed over a certain period to examine the effect of treatment response
Effect of age on height is not cofounded by generation
Expensive and take a longer time

Cross-sectional studies

Data on participants is collected from a single snapshot in time
Effect of age on height is cofounded by generation
Affordable and take a shorter time