120 
Fishery Bulletin 112(2-3) 
the individuals from each population in the mixture 
sample with 2 different techniques (“cross-validation 
over gene copies” [CV-GC] and K-fold cross valida- 
tion [K-fold], see next paragraph); 3) calculating the 
maximum likelihood estimator (MLE) of the mixture 
proportions for all the populations from the simulated 
sample through use of the baseline, which contains 
all training and holdout individuals; and 4) estimat- 
ing the mixing proportion of each reporting unit by 
summing the mixing proportion estimates of its con- 
stituent populations. For each of the 20 values of 
the mixing proportion vectors, 20,000 replicates were 
conducted with CV-GC, and 1000 replicates were con- 
ducted with K-fold. For both methods, the 5% and 95% 
quantiles of the distribution of the MLE of reporting- 
unit proportions were calculated from the replicates 
for each mixing proportion vector. 
Simulations were undertaken in 2 different ways. 
With CV-GC, genotypes were simulated by randomly 
sampling gene copies from the holdout set (to avoid 
high-grading bias), and those same gene copies were 
removed from the baseline when calculating the likeli- 
hood of population origin for the simulated individual 
(see Anderson et al., 2008). With K-fold, genotypes 
were simulated by drawing entire individuals with- 
out replacement (a technique commonly referred to as 
“jackknifing”) from the holdout set to form the mixture 
sample. Those sampled individuals were not included 
in the baseline, but all unsampled individuals from the 
holdout set were included in the baseline for estima- 
tion of the mixing proportions. 
Mixed fishery samples 
Samples from 2090 salmon landed in fisheries in 2010 
were collected by the California Department of Fish 
and Game (now Wildlife) at California ports. Just over 
half of these fish carried CWTs that identified their 
population of origin. All samples were genotyped with 
our panel of 96 loci. Individuals successfully genotyped 
at fewer than 60 loci were removed from further analy- 
sis. Failed genotypes were ones that either clustered 
with negative controls during scoring or fell outside of 
defined heterozygote and homozygote clusters, likely 
indicating sample contamination (Smith et al., 2011; 
Larson et al., 2013). We also used an individual het- 
erozygosity (iHz; the proportion of heterozygous loci 
for each fish) criterion of iHz >0.56 to identify and ex- 
clude potentially contaminated samples. Simulations of 
contaminated genotypes determined by using observed 
allele frequencies, indicated little overlap in the dis- 
tribution of iHz for contaminated and uncontaminated 
samples (data not shown) and that uncontaminated 
samples rarely had iHz >0.56. 
We used the maximum likelihood framework in gsi_ 
sim to estimate the mixing proportion of different pop- 
ulations among the 2090 fish, and then used that MLE 
as the prior for calculation of the posterior probability 
of population of origin for each fish. Posterior probabili- 
ties of origination from different reporting units were 
obtained through summation of the population-specific 
probabilities over all populations in a reporting unit. 
Individuals were then assigned to the reporting unit 
with the highest posterior probability. 
Because all fish would be assigned to a maximum a 
posteriori (MAP) population regardless of true origin, 
we employed a simulation method similar to that in 
Cornuet et al. (1999), but which was modified to ac- 
count for missing data, to detect fish that might have 
originated from a population that was not in the base- 
line or that had an otherwise aberrant genotype. Brief- 
ly, for each fish from the fishery assigned to a popula- 
tion, the allele frequencies from the MAP population 
were used to simulate 10,000 genotypes with an identi- 
cal pattern of missing data (if any) to that of the fish 
that was assigned. 
The log-probability of each simulated genotype was 
computed, given that it came from the population it 
was simulated from, and then the distribution of those 
values was compared with the log-probability, L a , of 
the actual assigned fish’s genotype, given the allele 
frequencies in the MAP population, on the basis of a 
z-score (L a minus the mean of the simulated values, all 
divided by the standard deviation of the simulated val- 
ues). The z-score calculation was done conditional on 
the exact pattern of missing data and was implemented 
in the C programming language as part of the gsi_sim 
software. A low-confidence assignment was defined to 
be one that had a z-score <3.0 and had either a report- 
ing unit posterior probability <0.9 or had fewer than 
90 loci successfully genotyped. Fish with low confidence 
assignments were left in an “unassigned” category. 
Results 
Genotyping and basic population genetics 
We successfully genotyped 8031 samples from 69 pop- 
ulations for the baseline and submitted the data to 
the Dryad Digital Repository (http://doi.org/10.5061/ 
dryad. 5745sv). All individuals were retained in the 
baseline, regardless of missing data because we de- 
sired a realistic representation of missing data pat- 
terns for subsequent power analyses. One locus failed 
to amplify entirely in the Copper River population, 
and 3 loci failed in the Coho Salmon sample. Unbi- 
ased estimates of heterozygosity (Nei, 1978) ranged 
from 0.194 in the Rapid River Hatchery stock of the 
Snake River reporting unit to 0.381 in the Smith Riv- 
er population. The Coho Salmon in the baseline had 
very low heterozygosity (0.094). Observed heterozygos- 
ity and mean number of alleles generally were lower 
for populations from north of the Columbia River (Ta- 
ble 1), likely due to an ascertainment bias resulting 
from the selection of SNPs with high MAFs in Califor- 
nia and Oregon populations. 
