Karl Pearson 
249 
m„ is nevertheless subject to the probable error of the random sampling. I have 
shown that the probable error of m,o, if the regression be linear, is 
VN \ ^ a,^ ) ■ 
But this of course is not the probable error of dIj,, but of m^, as found from the 
regression line, and will generally be small as compared with the probable error 
of m.p found from a single array f. 
Just as we have found Wp, however, from the whole system of observations, 
so it appears to me we ought to determine o'^yi^, ^^^^ froni the whole system and 
not from a single array. The centre of ct^^ is and not and to calculate a 
value for a^i^ with as centre seems to be an erroneous step especially when, 
owing possibly to rij, differing considerably from Tip, m,^ is much displaced from 
m^. I hold that for satisfactory results it is just as needful to find and 0^^^ 
from the whole range of data — not from the individual array — as it is to find mj,. 
This means that we must have some knowledge of the form of the frequency 
surface, and until we have this we cannot apply the test for goodness of fit to 
the regression line. There are accordingly two separate factors remaining for 
solution after we have determined the regression line : 
(i) We need to determine n^,. This is the total frequency of the jAh array 
of cc's. It can clearly be determined, when we know the frequency of the ^/'s 
in this group. That is to say, we must determine a theoretical distribution for 
the marginal distribution of the ^/-variate. In some cases it will be sufficient 
to assume it Gaussian and then the table of the probabiUty integral will suffice. 
In other cases it will be advisable to determine a skew frequency curve. But 
as a rule it will certainly be needful to graduate the array frequencies by some 
process, and not to assume them given by the observed marginal frequencies J. 
The bringing of the determination of ftp into line with that of m,, does not seem 
therefore to present great difficulties. 
(ii) We need to determine o-^j^ . If the frequency surface be liomoscedastic, 
then a'^Yip^ '^x^ — V^T.y)' Vj..v be the correlation ratio of x on y, and the 
regression be skew. But if the regression be both homoscedastic and linear, then 
o'^njj = (1 — vvhere r„j is the correlation coefficient. In these two cases 
we may write respectivelv 
K'- V x.yl 
* Biomeirilca, Vol. ix. p. 10, with the necessary changes in notation to fit the notation of this paper, 
t Extreme arrays here again form an exception. 
.■f A precisely similar difficulty arises in working the ordinary expression for mean square contin- 
gency, i.e. 1+ <p'^ = Si = — ^ \ , where and n.(i are the marginal frequencies (reduced of course in 
the proportion of size of sample to population) in the sampled population, not in the sample, although 
we ultimately use their sample values. There is more justification, however, in this use, for contingency 
is usually applied to broad categories and in such cases we have, perhaps, 3 to 7 marginal groups only; 
there is thus relatively less fear of big irregularities in Wp. or v.g such as arise with the small arrays of 
regression lines. 
