STATISTICS IN CLASSIFYING RACES OF SHAD 



277 



Table (i. — Frequency distributions of the discriminant 



function 



157. 

 158 



vertebrae, has been tabulated for the Hudson 

 River samples of 1939-40 and the Connecticut 

 River sample of 1945 in table 6. It is interesting 

 to note that the means of these distributions are: 

 Hudson River (1939), 149.962 (n=104), Hudson 

 River (1940), 150.362 (n=105), and the Connecti- 

 cut River (1945), 146.363 (n = 91); the variances 

 are 5.816, 6.465, and 6.400, respectively. Tin' 

 pooled average for the Hudson River is 150.163. 

 (A t-test shows that there is a highly significant 

 difference between the two rivers.) If one were 

 to use such a function for discrimination, lie would 

 classify everything above 148.2 as coming from 

 the Hudson River and everything below as com- 

 ing from the Connecticut River. However, since 

 the counts are discrete, it would be necessary to 

 use either 148 or 149 as the dividing line. Table 

 7 gives the percentage of wrong classifications for 

 these two values. This simple function of the 

 type Z = 2X, provides a method of classifying 

 about 78 percent of the individuals correctly. 

 Without the use of this function, it would appear 

 impossible to distinguish Connecticut River shad 

 from Hudson River shad. 



If such good results were obtained by totaling 

 the number of scutes, vertebrae and rays for each 

 specimen, perhaps some other combination might 

 be more efficient. Considering only linear forms 

 of the type F=2a I .X\, that function which is best 

 for discriminating between the two populations 

 can be determined. It can be shown (Rao 1952) 

 that the best linear discriminant function for two 

 multivariate normal populations is: 



D=l 1 X 1 + l,X,+l 3 X 3 +l 4 X 4 +l 5 X s +l 6 X 6 



where the 1,'s are obtained by solving the following 

 set of equations: 



l,W U + l2Wi2 + l3\Vl3 + l4WH + loW,5 + l 6 Wi6 = d, 



I1W21+I2W22 ... =d> 



llW 61 + loW 6 o 



l6 w 66 — de- 



w n is an estimate of the covariance (assumed to 

 be equal in the two populations) between the ith 

 and jth characters and d, is the estimated differ- 

 ence in mean values of the ith character in the 

 two populations. The Wu are estimated from the 

 following equations: 



(N 1 + N 2 -2)w 1J =S(X, lk -X il )(X 31k - 



k = l 



S(x i2k -x 12 ) (x 12k -x ]2 ). 



-X 5I ) + 



Ni and N 2 are the number of specimens in the 

 first and second sample, respectively, and X nk is 

 the count on the ith character for the kth fish 

 from population 1. X,, is the mean value of the 

 ith character for population 1. 



Using data from the Hudson River sample of 

 1939 and the Connecticut River sample of 1945, 2 

 the following set of equations is obtained: 



0.38197l I + 0.03742h + 006242l 3 -0.01515l 4 + 



0.02467l 5 +0.15184l 6 = 0.41484 



0.037421, + 0.71332l,-0.02032l 3 -0.01196l 4 + 



().023011 5 +0.180711 6 =0.46016 



0.06242l 1 -0.02032b + 0.65354l 3 4-0.21084l 4 + 



0.03309l 5 +0.09918l 6 =0. 71291 



-0.01515l 1 -0.01196l,+0.21084l 3 +0.884811 4 + 



0.00717l 5 +0.13073l 6 =0.38462 



0.02467l 1 + 0.02301b + 0.03309l 3 + 0.00717l 4 + 



0.58499l 5 -0.01020l 6 = 1.07555 



0.15184l 1 + 0.18071l2 + 0.09918l 3 + 0.13073l 4 - 



0.010201,+ 1.05154l 6 =0.55082 



Table 7. — Percentage of wrong classifications using the 

 function 



z=x i +x i +x 3 +x i +x s +x t 



- Only fish with complete meristic data were used; first 91 fish in table 13 

 (appendix). 



