Corandei et al A Bayesian method for identification of stock mixtures from molecular marker data 



553 



The optimization operations are repeated in a varying 

 order, until none of them improves the posterior prob- 

 ability Pi S\ data) of the current partition. The allocation 

 of sampling units to different clusters is based on the 

 obtained partition, and the suitable number of clusters 

 k is estimated from the partitions visited during the 

 simulation. 



Measurement of the strength of evidence for any 

 particular value of the partition S. given the marker 

 data, is an intricate process, in particular for large 

 data sets from complex stock mixtures. Theoretically, 

 the unknown largest posterior probability may be ex- 

 tremely small, even in situations where a particular 

 model provides an adequate fit to the observations. An 

 important factor explaining such a feature is the large 

 number of possible allocations, which all have positive 

 posterior probabilities by definition. This feature is of 

 general concern in a Bayesian analysis that comprises 

 vast model spaces, see, e.g., the discussion in Madigan 

 and Raftery (1994). Because the actual estimated value 

 of the posterior probability may be an intuitively mis- 

 leading goodness-of-fit measure, we use an alternative 

 strategy for characterization of the uncertainty in rela- 

 tion to the estimated allocation. 



Bayes factors (e.g., Kass and Raftery, 1995) provide a 

 computationally efficient approach to local assessment 

 of the amount of the peak of the posterior distribution 

 around an estimate of S. Let S* denote an alternative 

 allocation obtained from an estimate S by moving any 

 particular sampling unit to another putative stock. The 

 strength of evidence in favor of placing that sampling 

 unit in the original stock against placement in the new 

 stock is provided by the Bayes factor 



Be 



p(data\S)Pi.S) 

 p{data\S )P{S ) 



(4) 



fives). When only a single stock has a high conditional 

 posterior probability, the allocation is made on a firm 

 basis. However, when at least two sources are identified 

 with reasonably high posterior probabilities, the genetic 

 evidence is not conclusive enough for a classification of 

 the particular individual to a single source. The advan- 

 tage of the conditional posterior probabilities over Bayes 

 factors in characterization of the classification uncer- 

 tainty for each individual is that the former compares 

 simultaneously all putative sources, whereas the latter 

 provides only a pairwise judgement. 



The correct number of clusters needed to describe the 

 data can be estimated from the partitions that were 

 visited during the simulation. During the simulation 

 the algorithm stores the marginal likelihoods and the 

 sizes of the 30 best visited partitions, and the posterior 

 probabilities for the different numbers of clusters can 

 then be estimated analogously to those estimated by 

 Corander et al. (2004). Usually, if there is a lot of mo- 

 lecular data available (e.g., hundreds of loci have been 

 observed) only a few of the best partitions have influ- 

 ence on the computed posterior probabilities because 

 the relation of marginal likelihoods between different 

 partitions can be up to -expdOOO). If the data are 

 sparse (e.g., only about 10-20 loci have been observed) 

 and only partial baseline information is available, the 

 uncertainty related to the correct number of clusters 

 can be considerable because many partitions with dif- 

 fering sizes and approximately equal marginal likeli- 

 hood may be found. In these cases, to obtain a more 

 reliable estimate of the correct number of clusters, the 

 algorithm should be run multiple times with different 

 upper bounds (K) in order to facilitate the identifica- 

 tion of those partitions that have real influence in the 

 posterior probabilities. In our implementation of the es- 

 timation algorithm, we have included the possibility to 

 automatically process information from multiple runs. 



which measures how many times more plausible the 

 allocation S is for the particular sampling unit. When 

 the value of Equation 4 is small, say B^ ,,. < 10 (or log,. 

 Bggt < 2.3, Kass and Raftery, 1995), the data do not 

 strongly support a single origin for the particular sam- 

 pling unit. Because calculation of these Bayes factors is 

 computationally inexpensive, they can be easily provided 

 for every possible sampling unit or stock combination. 



In addition to Bayes factors, conditional posterior 

 probabilities for the allocation of each individual over 

 the range of different putative stocks identified through 

 S can be used to characterize the uncertainties in the 

 Bayesian estimate. The conditional posterior probability 

 distribution is defined for each individual by 



PiS\data)-- 



p(rfatalS,)P(S,) 

 "a 

 ^p(rfatalS,)P(S,; 



(5) 



where S, denotes that the particular individual is allo- 

 cated to the (th class of S (over the k possible alterna- 



Empjrical Illustration of the partition-based approach 



The Bayesian estimation algorithm described in the 

 previous section is implemented in BAPS software.^ 

 The examples considered here are produced by BAPS 

 analyses of data simulated by using the real data from 

 Koljonen et al. (2002), who assessed allele frequencies for 

 nine microsatellite markers in Atlantic salmon within 

 the Baltic Sea region. We have experimented with sev- 

 eral simulation configurations to investigate how our 

 method would be expected to perform under a variety 

 of biological conditions. 



The five wild stocks of Atlantic salmon considered in 

 Koljonen et al. (2002) correspond to five different rivers 

 draining into the Baltic Sea: Tornionjoki (TornW), Simo- 

 joki (Simo). lijoki (li), Oulujoki (Oulu), and Neva. The 

 pairwise genetic distances (Nei's D.,, Nei et al., 1983) 

 between these stocks underlying our simulations are 



' BAPS software is freely available at URL http://www.rni. 

 helsinki.fi/--jic/bapspage.html. Results presented here were 

 calculated with version 3.1 (release date 5 March 2005). 



