Candy et al Dividing population genetic distance data by partitioning optimization 
47 
n ^ i n 
1 k y / MivMjvDij 
CF = 
V=1 
Y" 
4—n=i 
Miv 
For each population a binary assignment variable 
indicates group membership such that group member- 
ship (Z) is assigned to each group (v) in an (nxk) binary 
matrix (M), where 
M e (0,11 
( 2 ) 
The optimal assignments of M are obtained through 
cost-function minimization (jCF) and visiting all com- 
binations of group memberships. Unlike other cost func- 
tions (Hofmann and Buhmann, 1997), there is no pen- 
alty for increased numbers of partitions; thus adding 
more partitions will always reduce the cost where the 
output is nonconvex and CF ^ 0 as k-*n. A nonpenal- 
izing cost function was implemented so that the gap 
statistic (discussed later) could be used for determining 
the optimal number of groupings. Adding partitions 
creates more and smaller groups while lowering mean 
intracluster distance (the sum of all pairwise distances 
divided by the number of populations). Meanwhile, add- 
ing more groups increases the sum of the mean intra- 
cluster distances. 
Implementation of the search algorithm 
Testing all group memberships at different cluster sizes 
can generate large numbers of combinations. Structure 
detection through partitioning is considered a com- 
binatorial optimization problem because visits to all 
combinations are computationally intensive. There is no 
guarantee of finding the optimal solution in a reason- 
able amount of time because the number of computa- 
tions grows rapidly with increasing data (Puzicha et 
al., 1999). We describe two search methods that have 
been used for these data: simple random search and 
complete search. 
Simple random search, a random set-partition as- 
signment of the binary matrix, is an obvious way to 
visit combinations of group memberships, where i = 1 
to n such that 
M(i,rand_v) = 1. (3) 
Alternatively, all nxk combinations can be visited as a 
complete list of set partitions where, for example, three 
populations can be partitioned into the form 
ABC ABIC AC IB AIBC AIBIC. 
Set partitions are the union of nonempty disjoint subsets 
called blocks, where restricted growth strings (RGS) 
(strings of numbers used as a convenient way to repre- 
sent partitions) were used to generate all blocks (Knuth, 
2005). We called visits to all partitions while minimizing 
the cost function (Eq. 1), partitioning optimization using 
restricted growth strings (PORGS). The number of ways 
n populations can be partitioned into these nonempty 
sets is called the Bell number (Rota, 1964; Cameron, 
1994). The total number of set partitions is the n th Bell 
number, and the number of set partitions for each k is 
determined by the Stirling number of the second kind 
(Cameron, 1994). 
Set partitions determined by RGS were used to con- 
figure the binary matrix to assign group membership. 
Although RGS can visit all possible partitions, they 
can also be used to generate partitions with “at most” 
r blocks (Knuth, 2005). This reduced search space al- 
lows bipartition (bi-PORGS) (r= 2) such that an opti- 
mum split can be determined one partition at a time. 
Information from prior group membership is used to 
restrict future searches, where 
M(i, v ) =1 for i = 1 to l, where v = 1 or 2. (4) 
A nested search occurs when all subgroups are sorted in 
descending order, and block combinations are selected 
when the cost function is minimized. Computational 
search time is reduced with the bi-PORGS method, thus 
allowing partitioning of larger sets of data. 
The gap statistic 
The objective of this analysis was to find an optimum 
number of groups, as well as the optimum partition 
solution, for k groups. Although there is no one criterion 
for deciding how many groups should be chosen to best 
represent the data, one guiding principle is that the 
appropriate number occurs when additional groups do 
not substantially change within-cluster dispersion. The 
gap statistic reveals within-cluster dispersion with that 
expected under an appropriate reference null distribu- 
tion with methods of Tibshirani et al., (2001) such that 
Gap n (k) = Epilog) J,CF)| - log(lCF), (5) 
where [CF = the observed values from the minimized 
cost function for each k\ and 
E* n (log( ICF)} - the log of the expected values from the 
reference distribution for each k. 
The gap statistic is largest when the observed values 
fall the farthest below the reference curve. The esti- 
mate of the optimum number of groups will be the 
value where additional groups do not increase the gap 
statistic. The expected values for the reference dis- 
tribution are generated by taking the mean PORGS 
values from bootstrapping the proximity matrix. Essen- 
tially, the mean values from the bootstrapped matrices 
remove the stock structure component from the refer- 
ence data. 
Simulated data 
Simulated data were used to validate the PORGS method 
by comparing the known distribution of data points with 
