Pella and Masuda: Bayesian methods for analysis of stock mixtures from genetic markers 
157 
n 
n 
D 
ivi 
Ph 1 "h Vih 1 + ( Z mi X mh 1 )> • ■ ■ > Pli.J,, + V ihj ,, 
m=l 
M 
( 10 ) 
(cont.) 
i = X2,...,c. 
Notice that each of the updated Dirichlet parameters for 
the HAG RF in the ith stock equals the sum of the prior 
parameter, the HAG count in the baseline sample, and the 
HAG count for the mixture individuals identified to the 
stock. 
The data augmentation algorithm cycles the two steps 
and eventually outputs a sequence, or chain, of samples 
of stock proportions and baseline genetic parameters from 
the posterior Bayes distribution. However, the early sam- 
ples of a chain are influenced by the starting values of p 
and Q. To make valid inferences, early burn-in samples 
must be discarded and a sufficient number of subsequent 
samples must be generated to accurately describe the pos- 
terior. Statistics to determine the number of samples to 
generate per chain — the Raftery and Lewis ( 1996) conver- 
gence diagnostic — or to monitor convergence of multiple 
chains to the desired posterior density — the Gelrnan and 
Rubin (1992) shrink factor — are widely used in practice 
and are comparatively inexpensive to compute. Although 
these two statistics are used in the applications that fol- 
low, MCMC research is currently very active (e.g. Brooks 
and Roberts, 1999), and alternatives may prove to be su- 
perior for these purposes. Main interest is usually in the 
stock composition of a stock mixture and so in the later ap- 
plications the statistics have been applied only to samples 
of p even though they can be applied to samples of Q as 
well. However, convergence of samples forp without corre- 
sponding convergence for Q is not thought to be possible. 
The diagnostic outlined by Raftery and Lewis ( 1996) de- 
termines the number of samples required for estimating 
quantiles (q) of posterior distributions with a specified ac- 
curacy (?•) and probability (s). The Fortran implementa- 
tion of the diagnostic, called gibbsit , 1 is applied in later 
examples (with r/=0.975, /-=0.02, and s=0.95) to each of 
several chains of stock proportions generated from differ- 
ent starting values. The diagnostic first requires that an 
initial pilot sample be generated for each chain, which is 
used to compute its recommended number of samples. An 
additional number of samples are generated to satisfy the 
maximum recommended. The combined samples — original 
pilot samples and the additional samples — are used with 
gibbsit as pilot samples to compute recommended sample 
sizes again. Further samples are generated if the maxi- 
mum recommended sample size for any stock exceeds the 
number so far generated. This iterative scheme is applied 
to the first chain, beginning with a pilot sample size of 
235 (the initial number recommended from the chosen val- 
ues of q, r, and s). The other chains are run the length of 
1 Fortran program gibbsit (version 2.0) can be obtained without 
cost from the general archive at http: / / lib.stat.cmu.edu / . 
the first chain and then analyzed separately by gibbsit. If 
gibbsit suggests that any of the chains should be longer 
than the first chain, then all the chains are run for the 
largest number of samples recommended for all stocks and 
all chains. 
Gelrnan and Rubin (1992) recommended running a 
small number of independent chains with dispersed start- 
ing points to reduce the possibility that a chain is ac- 
cepted as representative of the posterior distribution be- 
fore convergence has occurred. To monitor convergence of 
the chains to the posterior density, a univariate statistic, 
called the shrink factor (Gelrnan and Rubin, 1992), is com- 
puted 2 for each of the stock proportions. One chain per 
stock is started, with most of the stock mixture initially 
contributed by that stock. Once chains of samples are gen- 
erated and length determined from the Raftery and Lew- 
is diagnostic, shrink factors for all stocks are computed 
to verify that the chains have converged. The shrink fac- 
tor is computed from the second halves of the chains and 
compares the variation within a single chain for a given 
parameter to the total variation among the chains. Esti- 
mates of shrink factors close to one indicate convergence, 
and acceptable values are less than 1.2 (see Kass et al., 
1998). Because the shrink factor is computed from the sec- 
ond halves of chains, the first halves of chains are discard- 
ed as burn-in samples. The purpose of discarding initial 
burn-in samples is to remove dependence on the starting 
values. Samples subsequent to the burn-in samples may 
be thought of as coming from the desired posterior dis- 
tribution. Once convergence of chains has been verified, 
the MCMC samples (after burn-in discard) of stock compo- 
sition estimates are combined and summarized with var- 
ious statistics (equivalent to parameters because of the 
large samples): means, standard deviations, and empirical 
percentiles (2.5, 50, and 97.5). Baseline haplotype, allele, 
or genotype RFs can be summarized similarly. 
Checking the fit of the stock-mixture model 
Current stock-mixture modeling presumes that a stock 
mixture composed of random genotypes from the contrib- 
uting stocks occurs and that a simple random sample of 
the mixture individuals has been drawn. Further, it is 
assumed that simple random samples of HAGs from all 
contributing stocks are available by which to estimate, 
with an appropriate and known genetic model, the stock- 
mixture genotype RFs in each of the separate stocks. 
Samples are considered small in relation to the popula- 
tions sampled, so that the multinomial distribution can 
be used to describe sampling variation in counts. These 
assumptions may be plausible in many applications, but 
violations can also occur. For example, baseline samples 
of juvenile salmon, drawn before their families have 
mixed, would have extra-multinomial variation (Waples 
2 Our current Fortran program for the diagnostic is a translation 
of Gelman’s S function itsim (free from the Statlib S archive at 
http : / / lib.stat.cmu.edu / ), and its modification (version 0.4) in 
CODA (Best et al., 1995) (free from the MRC Biostatistics Unit, 
University of Cambridge, at http://www.mrc-bsu.cam.ac.uk/ 
bugs/). 
