250 Miscellanea 
VIII. On the Probability that two Independent Distributions of 
Frequency are really Samples from the same Population. 
By KARL PEARSON, F.R.S. 
(1) In a memoir contributed to the Phil. Mag. 1900 (Vol. 50, p. 157) I have dealt with the 
problem of the probability that a given distribution of frequency was a sample from a known 
population. That investigation was the basis of my treatment of the " goodness of fit " of theory 
and frequency samples. The present problem is of a somewhat different kind, but is essen- 
tially as important in character. We have two samples, and a priori they may be of the 
same population or of different populations ; we desire to find out what is the probability that 
they are random samples of the same population. This population is one, however, of which we 
have no a priori experience. It is quite easy to state innumerable problems in which such 
knowledge is desirable. We have two records of the number of rooms in houses where (i) a case 
of cancer has occurred, (ii) a case of tuberculosis has occurred ; the number of cases of each 
disease may be quite different, and we may not be acquainted with the frequency distribution 
of the number of rooms in the given district. What is the chance that there is a significant 
difference in the tuberculosis and the cancer houses ? Or again, we have a frequency distribution 
of the interval in days between bite and onset of rabies in two populations of bitten persons 
(i) who have been and (ii) who have not been inoculated in the interval. What is the pro- 
bability that the inoculation has modified the interval ? Many other illustrations will occur 
to those who are dealing with statistics, but the above will suffice to indicate the nature of 
the problems I have in view. 
(2) Let the population from which the two samples, if undifferentiated, are supposed to be 
drawn be given by the class-frequencies 
/M> /*■!, Hi ■■■ Hi,, ---Msi 
the total population being M. 
Let the samples be given by the frequencies in the same classes : 
Total 
1st Sample 
A 
h 
h 
fv 
h 
f. 
2nd Sample 
ft 
fi 
H 
/p 
ft 
/; 
where the totals iV and N' differ as widely or as little as we please. Let 2 U 2 2 , 2 3 ... 2 P , 2 9 ... 2, 
be the standard deviations of the frequencies of the first sample, 2/, 2 2 ', 2 3 '... 2 P ', 2 9 '... 2/ be the 
standard deviations of the frequencies of the second sample, and R m , R Pt ' be the correlations 
of the pth and qth frequencies of the two samples. Now the two samples are supposed to be 
absolutely independent. Hence there will be no correlation between any deviation in any 
frequency of the first row and the deviation of any frequency in the second row. Further the 
two frequency distributions being by hypothesis random samples of the population M, we have : 
2 2/?—- N^- S 'S ' R '— - N' fii"! 
Now consider the system of variables obtained by reducing the frequencies of the two samples 
N, N' to a common standard total n (e.g. to per inilles or per cents.), and subtracting the 
differences of each class-frequency. 
