552 



Fishery Bulletin 104(4) 



restriction k&K. This particular prior specification allows 

 a convenient estimation algorithm to be constructed for 

 identifying those allocations associated with high pos- 

 terior probabilities, given the molecular marker data. 

 Also, our prior is considerably more informative from the 

 biological perspective than the uniform prior with K = n 

 used in Corander et al. (2004) because it assigns a zero 

 probability to most allocations that are very unrealistic 

 for a moderate or large sample size n. 



The posterior probability of an allocation S over the 

 class of all putative candidates is defined as 



), (2) 



p{S\data) = p{data\ S)P(S) lY^ ^^.,p(data\ S)P[S 



where p(data\S) is the marginal likelihood of having 

 the allele frequency parameters of each class s, in S 

 integrated out analytically (formula Al in Corander et 

 al., 2003) according to 



k N, 



p{data\S) = Y[Y[ 



,=1 j=i 





(3) 



where /!,^, is now the observed number of copies of allele / 

 at locusj among catch individuals allocated to cluster ;. 

 The analytical integration approach was used earlier in 

 a related genetic context by Balding and Nichols ( 1997), 

 Rannala and Hartigan (1996), Rannala and Mountain 

 (1997). 



As can be seen from the above expression, the param- 

 eter that needs to be estimated in our stock mixture 

 model is the allocation partition S. In situations where 

 auxiliary biological information provides a pre-assign- 

 ment of some catch individuals into a priori sampling 

 units, each known to represent a single unknown ori- 

 gin, the above formula still applies. However, the prior 

 distribution of the allocation parameter needs to be 

 modified accordingly, to exclude the values of S that 

 assign individuals in the same sampling unit to distinct 

 origins. 



Algorithm for allocation estimation 



In the sequel, we consider the general situation where 

 the catch data are represented by a number of a priori 

 sampling units, each containing potentially more than 

 a single individual (see the previous section). When no 

 auxiliary information for pre-assignment to sampling 

 units is available, each sampling unit will correspond 

 to a single individual in the catch data. 



The two main aspects of our stock mixture model 

 that need to be estimated from marker data are the 

 number of clusters (k) suitable for a particular data 

 set, and the allocation of sampling units to the clusters. 

 Corander et al. (2003, 2004) used various stochastic 

 search strategies for estimation of a comparable model 

 without prior baseline samples. In our study, we de- 

 velop an alternative method to enable analyses of data 

 sets ranging from moderate to challenging because the 



MCMC algorithms introduced by Corander et al. (2003, 

 2004) are computationally time and memory intensive. 

 The central idea is to use a "greedy" stochastic search 

 algorithm (Fletcher, 1987) to find a partition S with the 

 highest posterior probability (Eq. 2). Repeated runs of 

 the algorithm enable investigation of the stability of the 

 estimation procedure, in a way that is similar to the 

 parallel MCMC approach of Corander et al. (2004). 



An initial partition for the algorithm is determined 

 by assigning sampling units one by one to such a clus- 

 ter, so that the resulting partition has the highest pos- 

 sible posterior probability. After specifying the initial 

 partition, the greedy algorithm proceeds by perform- 

 ing the following operations repeatedly on the current 

 partition: 



1 The algorithm moves sampling units from one cluster 

 to another. In a stochastic order, for each sampling 

 unit, it calculates the change imposed to the poste- 

 rior probability PiSidata) by moving the particular 

 sampling unit to any of the other clusters, including 

 even an empty cluster, unless that would lead to 

 k > K. It moves the sampling unit to the cluster that 

 increases the value of PiSidaia) most (if no increase 

 is possible, the sampling unit is not moved). 



2 It joins clusters. For each pair of clusters, the algo- 

 rithm calculates the change to Pi S\ data) imposed 

 by joining them. It joins the two clusters that cause 

 the maximal increase (if no increase is possible, no 

 clusters are joined). 



3 It splits clusters. Using the Kullback-Leibler diver- 

 gence between sampling units (Corander et al., 2003), 

 the algorithm splits each cluster into maximally 20 

 subclusters and calculates the change to PiSldata) 

 imposed by keeping one of the introduced subclusters 

 as a separate new cluster, or by joining it to any of 

 the previously existing clusters. It keeps the new 

 configuration that improves the value of P{S\data) 

 most (if no improvement is possible the split is not 

 made). 



4 It splits clusters into exactly two maximally homo- 

 geneous subclusters with the Kullback-Leibler diver- 

 gence, otherwise analogously as in step 3. 



5 It re-allocates several sampling units from a cluster. 

 In a stochastic order, for each cluster, the algorithm 

 orders the sampling units of the cluster such that the 

 first sampling unit is the one whose removal from the 

 cluster would improve the marginal likelihood of the 

 cluster most, and so on. A putative candidate for a 

 new partition is formed by moving sampling units one 

 by one to some other cluster, such that the P{S\data) 

 of the resulting partition is as high as possible (these 

 moves are performed even if a single move results in 

 a worse solution). If at some point the total change 

 in PiSldata) is positive, the putative candidate is 

 accepted and operation 5 is completed. When the total 

 change remains negative, even after re-allocation of 

 all the sampling units in a single cluster, the putative 

 candidate is rejected and another cluster is chosen as 

 a target until all clusters have been considered. 



