Linear Correlation. Suppose that an experiment is conducted, and the resulting observations are recorded in two data vectors
x = \begin{pmatrix}x_1\\x_2\\\vdots\\x_n\end{pmatrix}, y = \begin{pmatrix}y_1\\y_2\\\vdots\\y_n\end{pmatrix}, and let e = \begin{pmatrix}1\\1\\\vdots\\1\end{pmatrix}.
Problem: Determine to what extent the y_i ’s are linearly related to the x_i ’s. That is, measure how close y is to being a linear combination β_0e + β_1x.
The cosine as defined in (5.4.1) does the job.
cos θ = \frac{〈x|y〉}{||x|| ||y||}. (5.4.1)
To understand how, let μ_x and σ_x be the mean and standard deviation of the data in x. That is,
μ_x = \frac{\sum_i x_i}{n} = \frac{e^T x}{n} and σ_x = \sqrt{\frac{\sum_i (x_i − μ_x)^2}{n}} = \frac{||x − μ_x e||_2}{\sqrt{n}}.
The mean is a measure of central tendency, and the standard deviation measures the extent to which the data is spread. Frequently, raw data from different sources is difficult to compare because the units of measure are different—e.g., one researcher may use the metric system while another uses American units. To compensate, data is almost always first “standardized” into unitless quantities. The standardization of a vector x for which σ_x ≠ 0 is defined to be
z_x = \frac{x − μ_x e}{σ_x}.
Entries in z_x are often referred to as standard scores or z-scores. All standardized vectors have the properties that ||z|| = \sqrt{n}, μ_z = 0, and σ_z = 1. Furthermore, it’s not difficult to verify that for vectors x and y such that σ_x ≠ 0 and σ_y ≠ 0, it’s the case that
z_x = z_y \Longleftrightarrow ∃ constants β_0, β_1 such that y = β_0e + β_1x, where β_1 > 0,
z_x = −z_y \Longleftrightarrow ∃ constants β_0, β_1 such that y = β_0e + β_1x, where β_1 < 0.
• In other words, y = β_0e+β_1x for some β_0 and β_1 if and only if z_x = ±z_y, in which case we say y is perfectly linearly correlated with x.
Since z_x varies continuously with x, the existence of a “near” linear relationship between x and y is equivalent to z_x being “close” to ±z_y in some sense. The fact that ||z_x|| = ||±z_y|| = \sqrt{n} means z_x and ±z_y differ only in orientation, so a natural measure of how close z_x is to ±z_y is cos θ, where θ is the angle between z_x and z_y. The number
ρ_{xy} = cos θ = \frac{z_x^T z_y}{||z_x|| ||z_y||} = \frac{z_x^T z_y}{n} = \frac{(x − μ_xe)^T (y − μ_ye)}{||x − μ_xe|| ||y − μ_ye||}
is called the coefficient of linear correlation, and the following facts are now immediate.
• ρ_{xy} = 0 if and only if x and y are orthogonal, in which case we say that x and y are completely uncorrelated.
• |ρ_{xy}| = 1 if and only if y is perfectly correlated with x. That is, |ρ_{xy}| = 1 if and only if there exists a linear relationship y = β_0e + β_1x.
\vartriangleright When β_1 > 0, we say that y is positively correlated with x.
\vartriangleright When β_1 < 0, we say that y is negatively correlated with x.
• |ρ_{xy}| measures the degree to which y is linearly related to x. In other words, |ρ_{xy}| ≈ 1 if and only if y ≈ β_0e + β_1x for some β_0 and β_1.
\vartriangleright Positive correlation is measured by the degree to which ρ_{xy} ≈ 1.
\vartriangleright Negative correlation is measured by the degree to which ρ_{xy} ≈ −1.
If the data in x and y are plotted in ℜ^2 as points (x_i, y_i), then, as depicted in Figure 5.4.1, ρ_{xy} ≈ 1 means that the points lie near a straight line with positive slope, while ρ_{xy} ≈ −1 means that the points lie near a line with negative slope, and ρ_{xy} ≈ 0 means that the points do not lie near a straight line.
If |ρ_{xy}| ≈ 1, then the theory of least squares as presented in §4.6 can be used to determine a “best-fitting” straight line.