20|REGRESSION

Overview

Purpose

Correlation

  • Pearson’s \(r\), bivariate correlation coefficient
  • Quantifies the strength of linear relationship between two variables

Correlation example

Participant Sleep duration Test score
A 4 5
B 5 8
C 7 8
D 8 10
E 11 9

\(r = \dfrac{SP}{\sqrt{SS_X SS_Y}} = \dfrac{15}{\sqrt{30*14}} = 0.73\)

Regression

  • Regression defines the line of best fit
    • Makes relationship easier to see
    • Shows “central tendency” of the relationship
    • Emphasizes prediction

Regression

  • Regression defines the line of best fit
    • Makes relationship easier to see
    • Shows “central tendency” of the relationship
    • Emphasizes prediction
    • Line of best fit is the line that minimizes prediction error

Equations

Straight line equation

  • \(Y = bX + a\)
    • \(X\) and \(Y\) are variables
    • \(a\) (the intercept) and \(b\) (the slope) are constants
Celcius Fahrenheit
0 32
10 50
20 68
30 86
40 104
50 122

\(Y = 1.8 X + 32\)

Regression

  • Regression line equation
    • \(\hat{Y} = bX + a\)
    • \(\hat{Y}\): value of \(Y\) predicted by the regression equation for each value of \(X\)
    • \((Y - \hat{Y})\): residual (deviation of each data point from the regression line)
    • Regression defines line that minimizes the sum of squared residuals
    • \(SS_{residual} = \Sigma(Y - \hat{Y})^2\)
    • “Least-squared-error solution”

Regression: solving \(b\)

  • Regression line equation: \(\hat{Y} = bX + a\)
    • The slope of the line, \(b\):

\[\begin{align} b &= \dfrac{SP}{SS_X} \\ &= \dfrac{15}{30} \\ &= 0.5 \end{align}\]

Regression: solving \(a\)

  • The intercept of the line, \(a\)
    • The value of \(Y\) when \(X = 0\)
    • The line goes through \((M_X, M_Y)\) therefore:

\[\begin{align} a &= M_Y - b * M_X \\ &= 8 - 0.5 * 7 \\ &= 4.5 \end{align}\]

\(SS_{residual}\)

Sleep Test score \(\hat{Y}\) \(Y - \hat{Y}\) \((Y - \hat{Y})^2\)
4 5 6.5 -1.5 2.25
5 8 7.0 1.0 1.00
7 8 8.0 0.0 0.00
8 10 8.5 1.5 2.25
11 9 10.0 -1.0 1.00

\[\begin{align} SS_{residual} &= \Sigma(Y - \hat{Y})^2 \\ &= \Sigma(2.25, 1, 0, 2.25, 1) \\ &= 6.5 \end{align}\]

Standard error of the estimate

  • \(s_{error}\)
    • Quantifies precision of regression estimate
    • Average distance of points from the regression line
    • Remember… \(s = \sqrt{\dfrac{SS}{df}}\)

\(s_{error} = \sqrt{\dfrac{SS_{residual}}{df}}=\sqrt{\dfrac{6.5}{5-2}} = 1.47\)

Hypothesis test

Analysis of regression

  • Partitioning variance (like ANOVA)

\(SS_{Y}\)

\(SS_{regression}\)

\(SS_{residual}\)

\(df_{Y}\)

\(df_{regression}\)

\(df_{residual}\)

\(SS_Y = \Sigma(Y - M_Y)^2\)

\(SS_{residual} = \Sigma(Y - \hat{Y})^2\)

\(SS_{regression} = SS_Y - SS_{residual}\)

\(df_Y = n - 1\)

\(df_{residual} = n - 2\)

\(df_{regression} = 1\)

Analysis of regression

  • Partitioning variance (like ANOVA)

\(MS_{regression}=\dfrac{SS_{regression}}{df_{regression}} \ \ \ \ \ \ \ \ \ \ MS_{residual}=\dfrac{SS_{residual}}{df_{residual}}\)

\(F = \dfrac{MS_{regression}}{MS_{residual}}\)

Step 1: Hypotheses

  • \(H_0\): the slope of the regression line \(\beta = 0\)
    • i.e., there is no association between variables
    • Knowing \(X\) does not help to predict \(Y\)
  • \(H_1\): \(\beta \ne 0\)

Step 2. Critical region

  • Numerator: \(df_{regression} = 1\)
  • Denominator: \(df_{residual} = n-2\)
\(\alpha = .05\)
\(df_{numerator}\)
\(df_{denominator}\) 1 2 3 4 5 6 7 8 9 10
1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.39 19.40
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14
10 4.96 4.10 3.71 3.48 3.33 3.22 3.13 3.07 3.02 2.98

Step 3. Calculate

Participant Amount of sleep
(\(X\))
Test score
(\(Y\))
A 4 5
B 5 8
C 7 8
D 8 10
E 11 9

\(SS_{Y} = 14\)

\(SS_{residual} = 6.5\)

\(SS_{regression} = 7.5\)

\(df_Y = 4\)

\(df_{residual} = 3\)

\(df_{regression} = 1\)

Step 3. Calculate

Participant Amount of sleep
(\(X\))
Test score
(\(Y\))
A 4 5
B 5 8
C 7 8
D 8 10
E 11 9

\(MS_{regression} = \dfrac{SS_{regression}}{df_{regression}} = \dfrac{7.5}{1} = 7.5\)

\(MS_{residual} = \dfrac{SS_{residual}}{df_{residual}} = \dfrac{6.5}{3} = 2.17\)

Step 3. Calculate

Participant Amount of sleep
(\(X\))
Test score
(\(Y\))
A 4 5
B 5 8
C 7 8
D 8 10
E 11 9

\(F = \dfrac{MS_{regression}}{MS_{residual}} = \dfrac{7.5}{2.17} = 3.46\)

Step 4. Make decision

  • \(F > F_{critical}\)?
    • Reject or fail to reject \(H_0\), no relationship in population
  • Step 4b: Effect size
    • \(r^2\): Coefficient of determination
    • Proportion of variance explained by the regression

\(r^2 = \dfrac{SS_{regression}}{SS_Y} = \dfrac{7.5}{14} = 0.54\)

Step 5. Report

Longer sleep duration was associated with an increase in test performance, \(b = 0.5\). However, the association was nonsignificant; \(F(1, 3) = 3.46\), \(p > .05\).

Learning checks

CONCEPT is most closely related to CONCEPT

Residuals

Sum of squared residuals

\(s_{error}\)

\(MS_{regression}\)

\(SS\)

\(SP\)

\(s^2\)

\(s\)

\((X-M)\)

🥳

ojs minimizing error

hi hi