Corander et al A Bayesian method for identification of stock mixtures from molecular marker data 551 



more feasible when the number of genetically diverged 

 sources contributing to the observed data is unknown. 

 A wide variety of applications of this approach can be 

 found in the literature (e.g., Heuertz et al., 2004. Seppa 

 et al.. 2004, Maki-Petays et al., 2005). The approach 

 by Dawson and Belkhir (2001) is similar to that of 

 Corander et al. (2004) in spirit; however, it is subject to 

 two important limitations that prevent an efficient use 

 of this approach in the current context. First, there are 

 no readily available informative forms of the family of 

 prior distributions used by Dawson and Belkhir (2001), 

 which would be necessary for representing the baseline 

 information. Second, their model formulation does not 

 allow for missing alleles in the molecular marker data, 

 which are present in most real data sets. 



In our study, we extend the partition-based approach 

 to incorporate a priori baseline information, making 

 it suitable for identification of stock mixtures, either 

 under complete or partial baseline sample informa- 

 tion. Our focus is on the identification of the putative 

 genetic mixture in the catch sample data, provided by 

 the maximum a posteriori estimate of the assignment 

 of the individuals into an unknown number of sources. 

 Given the estimate, the proportions of the stocks in 

 the population can be readily inferred by using the 

 standard multinomial-Dirichlet model (e.g., Pella and 

 Masuda, 2001) and generic Bayesian software, such 

 as BUGS (Spiegelhalter et al., 2003), which has been 

 widely used for fish population modeling (e.g., Meyer 

 and Millar, 1999; Mantyniemi and Romakkaniemi, 

 2002; Mantyniemi et al., 2005). 



Another novelty in our method is the possibility of 

 using available biologically relevant information to pre- 

 assign catch data into groups that can be considered 

 as sampling units in the model. For instance, when 

 the behavior of the investigated species is such that 

 individuals obtained simultaneously at a single catch 

 location are known to represent the same (yet unknown) 

 stock, they can be allocated as a single unit to an ori- 

 gin. Such use of auxiliary information enhances the 

 statistical power to detect the correct origin when the 

 number of molecular marker loci available is limited. 

 To illustrate our modeling approach, and to investigate 

 its performance under various biological settings, we 

 present results from several simulation experiments 

 based partly on real molecular data for the Baltic Sea 

 stock mixture of Atlantic salmon iSalnio salar). 



Methods 



Bayesian stock mixture model 



In stock mixture estimation, there are typically avail- 

 able in samples two types of individuals, which are 

 genotyped. One type consists of individuals with known 

 origin (baseline data), and the other type represents a 

 catch sample, which may have been pooled from several 

 sources. Let m be the number of potential stocks, such 

 that for each stock ; = 1 ;?;, there are ;;, baseline 



individuals available. Furthermore, there may be an 

 additional number of potential stocks contributing to the 

 catch population; however, these are not represented by 

 any baseline samples. We let K (in^K) denote the total 

 number of potential stocks, which can have contributed 

 to a catch sample of n individuals, whose origins are 

 unknown. Notice that K is typically determined from the 

 relevant biological information about the species under 

 consideration. The target for our estimation is to infer 

 the number of stocks, say k, having actually contributed 

 to the catch sample, from the multilocus genotypes of 

 both the baseline and catch individuals. 



Under the assumption that the genetic information 

 consists of N^ molecular marker loci, where at each 

 locusj = 1,..., N,. there are Af,\,,, different alleles distin- 

 guished among all baseline and catch samples. Pella and 

 Masuda (2001) introduced a rather complicated empiri- 

 cal Bayes procedure to determine the prior distribution 

 for the allele frequencies in the potential stocks through 

 the observed genotypes of the baseline individuals (all 

 stocks were assumed to be represented by baseline sam- 

 ples). Here we consider a simpler approach, by suitably 

 modifying the standard Dirichlet prior used in Corander 

 et al. (2003, 2004). We assume that the allele frequen- 

 cies between marker loci are conditionally independent 

 given the stock origins and consider the potential stocks 

 to be in Hardy-Weinberg equilibrium (HWE). 



Let p^ii be the unknown frequency (or probability) 

 of allele I in the stock i at locus j, given that /;■ {k<K) 

 stocks are considered. Further, for each locus 7 = 1,..., 

 A'; , let a I be a hyperparameter for a Dirichlet prior 

 distribution of the allele frequencies of stock ; ((' = 1,..., 

 k;l = 1,..., A'^^,^,). Given the baseline information, we may 

 partially update our beliefs about the allele frequencies 

 using the posterior distribution derived from an initially 

 vague reference prior. For each of the m stocks, where 

 baseline samples are available, we set o,^, = n^^i + l/Nj^i^i, 

 where ?),^, is the observed number of copies of allele 1 at 

 locus 7 among individuals in the baseline sample of size 

 ?!,. This hyperparameter updating procedure is standard 

 in Bayesian analysis with the multinomial-Dirichlet 

 model (Gelman et al., 2004). Correspondingly, for the 

 other potential stocks, not represented by any baseline 

 information, the count n,^, is zero, and the hyperparam- 

 eter is determined as c/,^, = 1/N^,^,. 



A putative assignment of the catch data to the po- 

 tential stocks is represented in our study by a parti- 

 tion-valued parameter S = (s,,..., S/.), which allocates 

 the n individuals into k non-empty clusters. A cluster 

 is labeled as either the corresponding baseline sample 

 or, alternatively, as a group of unknown geographical 

 origin. The prior distribution of the allocation param- 

 eter P(S) is defined according to 



P(S = (Si,...,s, 



Ha 



f , ifk < K 

 otherwise 



(1) 



which corresponds to a uniform distribution over the 

 possible allocations of the catch individuals under the 



