Computing Inter-Rater Reliability And Its Variance In The Presence Of High Agreement

In other words, the paradox is present when the subjects studied tend to be classified at one of the possible outcomes. This is due either to the nature of the outcome itself and its high prevalence, or to the fact that at least one of the evaluators tends to relate more often to a given outcome. We come to the conclusion that it would still be appropriate to use the AC1 statistics in order to circumvent any risk, suffer the paradox and draw false conclusions about the results of the compliance analysis. During the literary research process [29], we asked a panel of three reviewers to judge the quality of 57 randomised controlled studies (RCTs), with each study scored using the Jadad scale [9]. This scale gives a study a score of zero to five and assesses the presence and relevance of double-blind design, the presence and suitability of randomization, and possible loss of subjects during the study. An RCT is considered of good quality if it scores 3 or more. To investigate certain aspects of the design, evaluators were asked to classify the study according to the type of randomisation unit (individual or community), the type of design assumed (parallel, factor or crossover) and the type of primary terminus (binary, continuous, survival or other). The classifications of the three reviewers are presented in Table 2, where the jadad score was dichotomized, distinguishing between the study of good quality (> 3) and less good quality (<3). In statistics, inter-rater reliability (also referred to by different similar names such as Inter-Rater agreement, inter-rater concordance, inter-observer reliability, etc.) is the degree of consistency between evaluators. It is an assessment of homogeneity or consensus in the assessments of different judges. The values of cohen`s Kappa statistics would lead to the conclusion that the similarities for the Unit, Design and Primary Endpoints variables are completely unsatisfactory. However, a simple "look" at the relative values of the observed concordance is enough to highlight the presence of paradoxes.

The most likely explanation for the beginning of the paradox can be given by high values, presented in Table 2, which are taken from the “Individual”, “Parallel” and “Continuous” steps for the variables Unit, design and primary endpoint. These values gave rise to a high probability of classification and, therefore, to paradoxical values of the Kappa statistic. In contrast, AC1 statistics show plausible values that correspond to the respective values of the observed concordance. Cohen`s Kappa statistics [16] are the most frequently used data in the literature. This statistic has no absolute applicability, because it suffers from a certain paradox already known in the literature [17-19]. Under special conditions [20, 21] and even in the absence of an internal or intra-council inter-council agreement, kappa statistics tend to be based on low values, which often leads to the conclusion that there is no agreement.. . . .