Blick and Hagen Use of agreement measures and latent class models to assess the reliability of classifying ttiermally marked otoliths 3 



K = iP„-P,.)/(l-P^.), 



where P,, = expected agreement = ('!h"h + "w"w'^"^- ^^^ 

 divisor, 1 - P., constrains k to be less than or equal to one, 

 and if all agreement is due to chance {P^=PJ, then k: equals 

 zero. Note that with k; independence between readers is 

 assumed in order to calculate expected agreement. 



An example of how agreement indices can be used to 

 monitor readings is shown in Figure 1, which displays k 

 and its standard error for 2874 chum otoliths readings di- 

 vided into 27 groups based on different reader pairs and 

 capture locations. Included are P,'s for four of the groups. 

 The results indicate that v levels were similar between the 

 different groups, suggesting overall consistency in read- 

 ings, although some of the groups had lower values, which 

 in practice would invite further investigation. 



The Pg's in Figure 1 have a different rank order than the 

 ic values. This apparent discrepancy highlights a potential 

 problem in interpretation when using agreement indices 

 to draw conclusions. To help illustrate this point, consider 

 the following examples (Table 2). Table 2A is generated as 

 the expected counts, given ;rj,|j^ = 0.9 and %|w = 1-0 for 

 both readers, and p = 0.1. In this case, P, = 0.98 and k: = 

 0.89. On the other hand. Table 2B is generated under the 

 same assumptions except that rt^n = 0.5. In this case P„ 

 drops only slightly to 0.95, whereas v drops to 0.47. Be- 

 cause the hatchery stock is rare, the inability of the read- 

 ers to detect the mark is not well reflected by P„ whereas k 

 reflects it better by correcting for the high level of chance 

 agreement. 



Now let K, 



HIH 



0.9 and /Twiw = 0.9 for both readers, and 



0.64. On 



P= 0.5 (Table 2C). In this case, P, = 0.82 and k 

 the other hand. Table 2D is generated under the same as- 

 sumptions except that P= 0.05. In this case, P, remains 

 unchanged at 0.82, but \' drops to 0.25. 



In none of the above examples is the index "wrong." 

 Rather, as is the case with most indices, interpretation is 

 affected by the values of the underlying parameters. In 

 the latter example (Table 2, C-D), even though P, is the 

 same for C and D, the scale it is being compared with has 

 changed, thus changing the value of k. This increases the 

 difficulty of comparing k across populations with differ- 

 ent underlying proportions. Note also that Table 2D could 

 have been derived from %|h = 0.5 and ttwiw = 0.944 for 

 both readers, andp = 0.19. Thus, without additional infor- 

 mation, it is impossible to draw reliable conclusions about 

 reader accuracies or the proportion of hatchery marks. 



Although agreement measures can be ambiguously in- 

 terpreted, in practice they can still sei've a useful moni- 

 toring role during routine comparisons when the circum- 

 stances of the readings are fairly well characterized. The 

 interpretive difficulties with indices such k and P, become 

 apparent when trying to translate agi"eement measures 

 into statements about the accuracy of different readers 

 and about the influence of reading error on the contribu- 

 tion estimates. 



Latent class models An alternative approach is to try 

 to estimate tTj^, j^ and tt^viw f""" each reader, along with p. 

 Although at first thought this may seem impossible, it can 



1 



ra 0.6 



04 



02 



T -^ 8si 920 tl 



J_ 



J_ 



10 20 



Group number 



30 



Figure 1 



The values of k{±1 SE) from 27 gi'oups of paired read- 

 ings of chum salmon otoliths (total=2874). The groups 

 are based on pairs of different readers examining oto- 

 liths collected at different times and locations. The pro- 

 portion of agreement (P,) is shown next to group 4, 7, 9, 

 and 12 for comparison with the value of k. 



be shown that either by setting a few constraints or by col- 

 lecting additional information, estimation is indeed pos- 

 sible. This problem falls into the category of latent class 

 modeling (e.g. Everitt, 1984; Bartholomew, 1987; McCutch- 

 eon, 1987; Clogg, 1995). Latent class models (LCMs) belong 

 to a family of latent variable models that hypothesize the 

 existence of unobservable "latent" variables, about which 

 information can be obtained only though measurements on 

 observable "manifest" variables. LCMs specifically restrict 

 the latent and manifest variables to be categorical. In 

 the present situation, the latent variable is the true class 

 (H or W) to which the otolith belongs, whereas the mani- 

 fest variables are the readers' classifications. Such models 

 have been used for assessing reliability of diagnostic tests 

 in the medical field over the last 20 years (see Walter and 

 Irwig, 1988; Formann, 1996, for reviews). 



Returning to the problem with two readers, neither of 

 which is a standard, there are five essential parameters to 

 estimate: s-i)|H,^H|H'^w|w.'fw|'w ' andp, with only 3 df (four 

 pieces of data, /i^H' "hw "wH' "ww- minus one because the 

 sample size, n, is fixed). Thus, the model is overparameter- 

 ized, and either constraints on the parameters or more da- 

 ta are needed. Possible constraints include 1) considering 

 that two of the parameters are known (e.g. /r^vjw = Tw|w = ^• 

 i.e. both readers always call a wild stock correctly, there 

 are no "false positives"), or 2) considering that two sets of 

 parameters are equal (e.g. t1|'|'h , 7r|f|H , ;r\v|'\v ='fwi'w' i-^- the 

 accuracy rates are the same for both readers). 



Although there may be times when such constraints are 

 realistic, in general they will not be; therefore more infor- 



