Abstract
© 2021 American Psychological AssociationCurrent interrater reliability (IRR) coefficients ignore the nested structure of multilevel observational data, resulting in biased estimates of both subject- and cluster-level IRR. We used generalizability theory to provide a conceptualization and estimation method for IRR of continuous multilevel observational data. We explain how generalizability theory decomposes the variance of multilevel observational data into subject-, cluster-, and rater-related components, which can be estimated using Markov chain Monte Carlo (MCMC) estimation. We explain how IRR coefficients for each level can be derived from these variance components, and how they can be estimated as intraclass correlation coefficients (ICC). We assessed the quality of MCMC point and interval estimates with a simulation study, and showed that small numbers of raters were the main source of bias and inefficiency of the ICCs. In a follow-up simulation, we showed that a planned missing data design can diminish most estimation difficulties in these conditions, yielding a useful approach to estimating multilevel interrater reliability for most social and behavioral research. We illustrated the method using data on student–teacher relationships. All software code and data used for this article is available on the Open Science Framework: https://osf.io/bwk5t/. (PsycInfo Database Record (c) 2021 APA, all rights reserved) Translational Abstract—Observational studies in social and behavioral science often have a multilevel structure, with subjects nested within clusters. To inspect the quality of rating procedures and improve these where necessary, interrater reliability (IRR) should then be defined for the subject-level and cluster-level of the data separately. In this article, we propose a method to assess IRR for multilevel continuous ratings provided by two or more raters. We explain how generalizability theory can be used to decompose the variance of multilevel observational data into subject-, cluster-, and rater-related components. We explain how IRR coefficients for each level can be derived from these variance components, and how they can be estimated as intraclass correlation coefficients (ICC). We assessed the quality of the proposed estimation procedure with a simulation study, and showed that small numbers of raters were the main source of bias and inefficiency of the ICCs. In a follow-up simulation, we showed that a planned missing data design can diminish most estimation difficulties of the ICCs, yielding a useful approach to estimating multilevel interrater reliability for most social and behavioral research. We illustrated the use of the proposed ICCs with data on student-teacher relationships. All software code and data used for this article is available on the Open Science Framework: https://osf.io/bwk5t/ (PsycInfo Database Record (c) 2021 APA, all rights reserved)
Original language | English |
---|---|
Pages (from-to) | 650-666 |
Journal | Psychological Methods |
Volume | 27 |
Issue number | 4 |
DOIs | |
Publication status | Published - 2021 |
Externally published | Yes |
Funding
This work was partly supported by the Dutch Research Council (NWO), Project 016.Veni.195.457, awarded to Terrence D. Jorgensen. We thank Debora Roorda and Marjolein Zee for sharing their multiple-rater data for our example analyses and SURFsara (www.surfsara.nl) for the support in using the LisaNationalCompute Cluster to conduct ourMonte Carlo simulations.
Funders | Funder number |
---|---|
Nederlandse Organisatie voor Wetenschappelijk Onderzoek |