Cohen’s Weighted Kappa

- Introduced by Jacob Cohen for Educational and Psychological Measurement in 1960.

- Measures inter-rater reliability between a fixed set of raters assigning binominal ratings.

Interpretation:

The degree of absolute agreement of categorical variables.
Landis, J. R. and Koch, G. G. (1977) "The measurement of observer agreement for categorical data" in Biometrics. Vol. 33, pp. 159–174 gives the following suggestion of intepreting Kappa:
below 0.00: poor
between 0.00 and 0.20: Slight
between 0.21 and 0.40: Fair
between 0.41 and 0.60: Moderate
between 0.61 and 0.80: Substantial
above 0.8: Excellent

Assumptions:

The outcomes are categorical.

Each object is categorized by two of the same raters.

Cannot have more than 2 raters.

The categories used for each object is the same.

Formula:

\(1-\frac{\sum_{ij} w_{ij} p_{ij}}{\sum_{ij}w_{ij}e_{ij}}\)

\(w_{ij}\) = weights;

\(p_{ij}\) = the observed probablities;

\(e_{ij}\) = the expected probablities ;

Example in R

Untitled.utf8
library("irr")
## Loading required package: lpSolve
data("diagnoses")
diagnoses[1:2]
##                     rater1                  rater2
## 1              4. Neurosis             4. Neurosis
## 2  2. Personality Disorder 2. Personality Disorder
## 3  2. Personality Disorder        3. Schizophrenia
## 4                 5. Other                5. Other
## 5  2. Personality Disorder 2. Personality Disorder
## 6            1. Depression           1. Depression
## 7         3. Schizophrenia        3. Schizophrenia
## 8            1. Depression           1. Depression
## 9            1. Depression           1. Depression
## 10                5. Other                5. Other
## 11           1. Depression             4. Neurosis
## 12           1. Depression 2. Personality Disorder
## 13 2. Personality Disorder 2. Personality Disorder
## 14           1. Depression             4. Neurosis
## 15 2. Personality Disorder 2. Personality Disorder
## 16        3. Schizophrenia        3. Schizophrenia
## 17           1. Depression           1. Depression
## 18           1. Depression           1. Depression
## 19 2. Personality Disorder 2. Personality Disorder
## 20           1. Depression        3. Schizophrenia
## 21                5. Other                5. Other
## 22 2. Personality Disorder             4. Neurosis
## 23 2. Personality Disorder 2. Personality Disorder
## 24           1. Depression           1. Depression
## 25           1. Depression             4. Neurosis
## 26 2. Personality Disorder 2. Personality Disorder
## 27           1. Depression           1. Depression
## 28 2. Personality Disorder 2. Personality Disorder
## 29           1. Depression        3. Schizophrenia
## 30                5. Other                5. Other

Convert the data to a contingency table

tdiag = table(diagnoses[1:2])
tdiag
##                          rater2
## rater1                    1. Depression 2. Personality Disorder
##   1. Depression                       7                       1
##   2. Personality Disorder             0                       8
##   3. Schizophrenia                    0                       0
##   4. Neurosis                         0                       0
##   5. Other                            0                       0
##                          rater2
## rater1                    3. Schizophrenia 4. Neurosis 5. Other
##   1. Depression                          2           3        0
##   2. Personality Disorder                1           1        0
##   3. Schizophrenia                       2           0        0
##   4. Neurosis                            0           1        0
##   5. Other                               0           0        4

Calculate weighted Kappa

library(psych)
wkappa(tdiag)
## $kappa
## [1] 0.6511628
## 
## $weighted.kappa
## [1] 0.6554622

References:

Landis, J. R. and Koch, G. G. (1977) "The measurement of observer agreement for categorical data" in Biometrics. Vol. 33, pp. 159–174
Fleiss, J.L., and others. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82.
Joseph L. Fleiss, Myunghee Cho Paik, Bruce Levin. 2003. Statistical Methods for Rates and Proportions. 3rd ed. John Wiley; Sons, Inc.