24 THE MATHEMATICAL TREATMENT OF DATA 



This problem has been solved by using the square root of the relationship 

 index with the proper sign. That is, 



r = ±\/R , 



where r is what is technically defined as the correlation coefficient. It 

 can be shown algebraically that this definition of r is entirely equivalent 

 to the one used in standard texts: 



£(z — x)(y — y) 



VZ(x -- x) 2 -Z(y -- v) 2 



Here the numerator is to be understood as showing the extent to which x 

 and y vary together. For a large number of pairs of values of x and y, if 

 they were unrelated, then for each particular plus value of (y — y), there 

 would be both a positive and a negative value of (x — x) , so that in the 

 sum these terms would cancel out. If there were actually a correlation, 

 then when the x difference was positive the y difference would also tend 

 to be positive and there would be no cancellation. The purpose of the 

 denominator is to make the coefficient dimensionless and to give it the 

 values zero and unity for, respectively, unrelated and perfectly related 

 variables. To prove the equality of the two definitions, use must be 

 made of the least-squares line relating y and x. 



By using our definition based on clustering, those of us with mathe- 

 matical facility may use any type of theoretical expression for y and 

 can see how to compute the correlation even for curved lines. The rest 

 of us have to take the expressions for r as given. 



A point which is very much understressed in many standard treat- 

 ments is that the correlation index — the square of the correlation coeffi- 

 cient — is the direct measure of the extent to which variables are cor- 

 related. Thus, a correlation coefficient of 0.6 looks respectably large, but 

 we note that its square is 0.36. This tells us that only about % of the 

 variation in y is correlated with variations in x, the remainder being 

 associated with factors which do not affect x. From this point of view 

 we may say that one-third of the factors determining x and y are shared, 

 the others being different for the two variables. 



