Pella and Masuda: Bayesian methods for analysis of stock mixtures from genetic markers 
155 
prior for the allele RFs, their values must be assigned. Two 
approaches — empirical Bayes and pseudo-Bayes — are con- 
sidered in which the prior parameters are functions of al- 
lele counts in the baseline samples. The empirical Bayes 
method was previously developed for geneticists to esti- 
mate allele RFs (Lange, 1997). In this method, the values 
assigned to the /Is are those which maximize the Bayes 
prior predictive distribution (Gelman et al., 1995) for the 
allele counts in the baseline samples. This distribution is 
the marginal distribution of the allele counts, which results 
from averaging their multinomial distribution, Mult(n t , q t ), 
weighted by the prior probability of q r D((3 V j3 2 , ... , P T ). 
The prior predictive distribution is parameterized by the 
/Is alone, and the optimizing values can be computed from 
the allele counts (Appndx. 1). Limited experience during 
this study indicates that, with large baseline samples of 
a few baseline stocks, or lesser baseline samples for large 
numbers of baseline stocks, the empirical Bayes method 
can provide values for the prior parameters, which result in 
sensible weighting of the observed sample and prior means. 
Commonly, baseline sampling is more limited, and pragma- 
tism requires a less-demanding alternative method. 
The pseudo-Bayes method is based on several practical 
considerations to determine values for the baseline prior 
parameters. First, the baseline prior parameters, fi v p 0 ,..., 
P T , have no intrinsic value, other than as tuning parame- 
ters by which to perform stock-mixture analysis. Nonethe- 
less, a sound rationale and simple computational formu- 
las for their values are desirable. Second, the prior mean 
should reflect the similarity of the allele RFs among the 
baseline stocks. Third, the weights assigned to the prior 
and observed allele RFs should allow a realistic evalua- 
tion of the uncertainty in the genetic composition of the 
stock, yet not cause misleading bias in the estimated stock 
composition. Loci with large variation among stocks have 
more effect on estimated stock composition than those 
with small variation. Therefore, shrinkage from observed 
allele RFs toward prior means for loci with large varia- 
tion should be less than for loci with small variation. If 
the prior parameter sum, /!., is substantially smaller than 
the baseline sample sizes, the bias will be limited. How- 
ever, with j3.=0 , all weight goes to the observed RFs. Then, 
when a baseline sample misses an allele that is present, 
sampling error will be underestimated (as it is with boot- 
strapping under the CML method). Fourth, and last, the 
weight assigned to the observed RFs for a stock should be 
positively related to its baseline sample size. 
The pseudo-Bayes method of this proposal is original to 
estimating allele RFs and appears in practice to satisfy 
the aforementioned criteria. The prior mean will be cen- 
tered within the observed allele RFs for the stocks of the 
baseline samples with 
P, = P--y t , t - 1,2 t, 
where p. = is an estimate (Appndx. 2) of the value for j8. 
that minimizes the baseline risk, or expected 
squared-errors between the posterior means 
at Equation 4 and the unknown allele RFs of 
all baseline stocks, and 
— 1 y 
y = — > — = is the baseline center, or unweighted arith- 
c >=i n ‘ metic mean, of the observed RFs for the 
hth allele among stocks. 
With this definition for the /Is, the prior mean equals the 
baseline center. The posterior mean for any stock is the 
weighted average of its observed allele RFs and the base- 
line center as at Equation 4. Although the central allele 
RFs for the entire set of baseline stocks anchors the esti- 
mation of Q in this description, extensions to accommodate 
regional or other groupings of stocks could be accomplished 
as simply by anchoring on regional or group centers. 
Complete analysis of the baseline requires repeated and 
separate application of the empirical Bayes or pseudo- 
Bayes methods to each locus. Suppose a total of H loci com- 
pose the stock-mixture multilocus genotypes. Let the /zth 
locus have J h alleles with prior parameters j3 /( = (j8 /(1 , P h2 , 
. . . , P hJ/ ) and allele RFs in the ith stock of q ih = (q ihl , q ih 2 , 
• • • - QihjJ- If Qi denotes the ith stock’s combined arrays, 
qr n ,<jr ( - 2 , . . . ,q iH , then the prior for the allele RFs of the com- 
plete baseline, Q=(Q V Q 2 , . . . ,Q C ), will be 
c c H 
m> qh 
i=l i= 1 h = l 
H V 
\jD(P hl ,p h2 ,...,p hJi ) , 
V h=l y 
that is, prior draws for allele RFs are independent among 
stocks and loci. 
The baseline samples are drawn independently from the 
stocks. Denote by Y ( = iy iv y i2 , • • • ,;y ( 7 /) the H arrays of al- 
lele counts in the baseline sample for the ith stock, and by 
Y, the entire baseline collection of Yj,Y 2 , . . . ,Y C . Then the 
Bayesian posterior density for the allele RFs of the entire 
baseline is the product of Dirichlet densities, 
c c H 
k(q i y> = i ^ )= nri ;r < 9 i y <-* )= 
/=1 i = l h = l 
c H (5) 
| i| f\D(p hX +y lln ,...,p hJk +y ihJ ), 
i=\ /» = 1 
and each density in the product has a mean vector, for 
the stock and locus, equal to a weighted average of the 
observed allele RFs and corresponding prior means (as at 
Eq. 4). Although the statistical modeling of the baseline 
samples has been described with alleles and loci, it applies 
equally to any combination of independent components: 
alleles at loci, haplotypes at mtDNA, and genotypes at loci 
in Hardy- Weinberg disequilibrium. 
Stock-mixture sample likelihood function for 
unknowns, g(X\d) 
The stock-mixture sample likelihood function is propor- 
tional to the probability of drawing the observed stock- 
mixture genotypes as a function of the unknowns, p and 
