The classification criterion used by the 
pattern recognition network is evaluated by 
comparing the resulting error curve with those 
obtained from likelihood ratio classification 
functions resulting when different approxi- 
mations to the joint distribution are used. It 
is also compared with some well known intuitive 
classification procedures based on linear 
functions of the variables xj. To simplify 
the exposition only the case of two groups is 
treated here. 
Let p(*/i), i=1,2., denote respectively 
the probability distribution for x under group 
1 and group 2. Then the likelihood ratio 
regions for classification are defined by: 
: = p(x/1) : 
Ry: L(x) SCTE st: 
Rot L(x) <e 
Thus if L(x) >t, the pattern is classified 
as belonging to group 1, otherwise the pattern 
is classified into group 2. The error curve 
can be obtained by computing the probabilities 
of misclassification for different choices of 
the threshold t. For a given threshold, the 
probability of incorrectly classifying a pattern 
from group 1 into group 2 is given by 
2 and the probability of 
MH xb (x) < t 
incorrectly classifying a pattern from group 2 
S p(x/2) 
into group 1 is y= — 
KL) Set 
The classifi- 
cation functions and corresponding networks which 
result when various approximations to p(x) are 
used in a likelihood ratio procedure can now be 
derived. 
If a first order approximation is used, 
p(x) is replaced by Prat) This implies an 
assumption of independence of x;. Letting 
mi Ey(x/1)(*i) and my = By(y/2)(xq), the 
likelihood ratio 
N : Alesea 
TT m, 2(1-m;) ee 
L(x) = a 
Trai! (1-ny) **4 
i=l 
x 
i 
taking the logarithn gives = (asx, +4) 
where a; =log oot (1-n;) | and c; = log (1-m;) 
n; (1-m, ) ( I=nj) 
The summation over cy can be absorbed in the 
threshold and a particular weighted sum is 
obtained for the classification function. If 
a priori probabilities are included, one gets 
the cognitive nets suggested by Minsky 5 , 
The second order approximation p, »{x) 
neglects all correlations except those of the 
second order, thus implying that the joint 
distribution for the x; is a multivariate 
normal distribution. If the further assumption 
is made that the groups have equal covariarce 
matrices (no assumption of statistical inde- 
pendence of the x; is made) then it can be 
shown6,7 that the likelihood ratio, which is 
now the ratio of two multivariate normal density 
functions which differ only in their means, leads 
to a linear function of the x,, called the dis- 
criminant function, given by 
N 
28 
iXj, where a,=(qy5d)+doqdo+...*aysdy), 
v 
—s 
where Q3i are elements of Q the inverse of the 
common covariance matrix, and d-=m;-n 
2) aa) ier) 
(j=1,2,...N). This function, which according 
to the likelihood ratio is optimum for the case 
of continuous variables with multivariate normal 
distributions and equal covariance matrices 
in the group is, for the case of arbitrary 
distributions the linear function intro- 
duced by Fisher3, The sense in which 
maximum discrimination between the two 
groups is provided by Fisher's discriminant 
function is to choose the coefficients a;; 
such as to maximizes the ratio 
(Z a,d,) 
N N ae 
a Zaja,q 
i=l j=1 
where qid are elements of the covariance matrix 
Ose. It is the linear function which maximizes 
the variance between samples relative to the 
variance within samples me 
Without the assumption of equal covariance 
matrices in the groups, even the second order 
approximation would result in a network in- 
volving multipliers. So would higher order 
approximations of the form py,(x) or other 
approximations which neglect all terms in the 
expansion of f(x) except for the first term 
and a particular higher order correlation. 
If the total distributions p(x/i) are used 
along with a priori probabilities and costs, 
the networks which result are those derived by 
Chow”. The best error curve is of course 
obtained when the complete joint distributions 
are used. 
343 
