Article Text


Concise report
Reliability, sensitivity to change and feasibility of three radiographic scoring methods for hand osteoarthritis
  1. J Bijsterbosch1,
  2. I K Haugen2,
  3. C Malines3,
  4. E Maheu3,
  5. F R Rosendaal4,
  6. I Watt5,
  7. F Berenbaum3,
  8. T K Kvien2,
  9. D M van der Heijde1,2,
  10. T W J Huizinga1,
  11. M Kloppenburg1
  1. 1Department of Rheumatology, Leiden University Medical Center, Leiden, The Netherlands
  2. 2Department of Rheumatology, Diakonhjemmet Hospital, Oslo, Norway
  3. 3Department of Rheumatology, Pierre & Marie Curie University, AP-HP Hospital Saint-Antoine, Paris, France
  4. 4Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands
  5. 5Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
  1. Correspondence to Jessica Bijsterbosch, Department of Rheumatology, Leiden University Medical Center, C1-R, PO Box 9600, 2300 RC Leiden, The Netherlands; J.Bijsterbosch{at}


Objective To compare the reliability, sensitivity to change and feasibility of three radiographic scoring methods for hand osteoarthritis (OA).

Methods Baseline, 2-year and 6-year hand radiographs of 90 patients with hand OA were read in triplicate in chronological order by three readers from different European centres using the OARSI atlas (OARSI), Kellgren–Lawrence grading scale (KL) and Verbruggen–Veys anatomical phase score (VV). Reliability was determined using intraclass correlation coefficients and smallest detectable change (SDC). Sensitivity to change was assessed by the proportion of progression above the SDC. Feasibility was reflected by the mean performance time.

Results Intra- and inter-reader reliability was similar across methods. Inter-reader SDCs (% maximum score) for KL, OARSI and VV were 2.9 (3.2), 4.1 (2.9) and 2.7 (1.8) over 2 years and 3.8 (4.1), 4.6 (3.3) and 4.0 (2.5) over 6 years, respectively. KL detected a slightly higher proportion of progression. There were differences between readers, despite methods to enhance consistency. The mean performance time (SD, minutes) for KL, OARSI and VV was 4.3 (2.5), 9.3 (6.0) and 2.8 (1.5), respectively.

Conclusion Methods had comparable reliability and sensitivity to change. Global methods were fastest to perform. For multicentre trials use of a central reading centre and multiple readers may minimise inter-reader variation.

Statistics from


Despite the high prevalence and health impact of hand osteoarthritis (OA), no structure-modifying treatments exist.1 2 The development of these treatments implies the need for reliable and sensitive outcome measures.3 Structural damage is considered a primary outcome, with serial radiographs as recommended outcome measure. Various radiographic scoring methods exist to assess severity and progression of structural damage.4,,10 They differ with respect to the number of hand joints scored, the use of a global score as opposed to grading of individual radiographic features, the radiographic features scored and the grading of features. There is no consensus on the preferred method, but owing to these differences the choice of method may depend on the study objective.

Only one previous study has compared scoring methods for hand OA, which was over a relatively short period of 1 year.11 In order to gain further insight into the clinimetric properties of available scoring methods, we assessed the reliability, sensitivity to change and feasibility of three radiographic scoring methods for the assessment of hand OA over a period of 2 and 6 years.

Patients and methods

Study design and patient population

Patients were participants of the Genetics ARthrosis and Progression study comprising 192 Caucasian sib pairs with symptomatic OA at multiple sites in the hand or in at least two of the following sites: hand, knee, hip or spine. Patients were evaluated at baseline and some of them after 2 and 6 years. Details on the recruitment and selection have been published elsewhere.12 The study was approved by the medical ethics committee.

Patients were eligible for this study if they had hand OA defined by the American College of Rheumatology criteria for clinical hand OA13 or if structural abnormalities were present and if baseline, 2-year and 6-year radiographs were available. From this group a sample of 90 patients was included to ensure variability in baseline and progression scores based on a previous study.14 See supplementary online appendix 1 for more information on inclusion and sampling.

Radiographs and scoring methods

Standardised hand radiographs (dorsal-volar) were obtained at baseline and follow-up by a single radiographer.

With the Kellgren–Lawrence grading scale (KL),6 10 a global score, the distal interphalangeal (DIP) joints, proximal interphalangeal (PIP) joints, interphalangeal thumb (IP-1) joints, metacarpal (MCP) joints and first carpometacarpal (CMC-1) were graded 0–4 as described in the atlas (0=no OA; 1=doubtful OA; 2=definite minimal OA; 3=moderate OA; 4=severe OA). Total scores range from 0 to 120.

Using the OARSI atlas (OARSI)4 individual radiographic features were graded. Osteophytes (0–3), joint space narrowing (JSN) (0–3), subchondral erosions (0–1), sclerosis (0–1) and malalignment (0–1) were assessed in the DIP, PIP, IP-1 and CMC-1 joints. Pseudowidening (0–1) was assessed in the DIP joints and cysts (0–1) were assessed in the PIP and CMC-1 joints. Total scores range from 0 to 198.

The Verbruggen–Veys anatomical phase score (VV)9 comprises five phases with a numerical value representing the evolution of hand OA: N=normal joint; S=stationary OA with osteophytes and JSN; J=complete loss of joint space in the whole or part of the joint; E=subchondral erosion; R=remodelling of subchondral plate. The DIP, PIP, IP-1 and MCP joints were assessed. This score ranges from 0 to 218.4.

Reading procedures

Radiographs of all time points were read simultaneously in chronological order blinded for patient characteristics by three readers (JB, IKH, CM) from three European centres independently. Readers attended a training session before starting the study. A standard set of radiographs with scores was available for individual practice.

For assessment of intrareader reliability a random sample of 40 sets of radiographs was rescored with each method.

To randomise patients as well as methods a random number was assigned to each possible patient–scoring method combination, resulting in 390 combinations ((90 sets + 40 sets for intrareader reliability)×3 methods). To avoid mistakes and confusion because of frequent switching between methods, we grouped scoring methods for each 10 sets of radiographs.

Statistical analysis

To evaluate intra- and inter-reader reliability for status scores, intraclass correlation coefficients (ICCs) were estimated. For change scores measurement error due to intrareader and inter-reader variability was assessed by estimating the smallest detectable change (SDC).15 Sensitivity to change was assessed by the percentage of progression above the SDC. This analysis was done for all joints together and for separate joint groups (DIP/PIP, MCP and CMC-1 joints, see online supplement). Feasibility was determined by the mean scoring time of three time points for all readers together. The relationship between radiographic scores and performance time was assessed using linear regression analysis.


At baseline the mean age was 60.2 years and 70 patients (78%) were female. The observed status and change scores are shown in online supplementary appendix 2. There were differences between readers, especially for change scores.

Intrareader and inter-reader ICCs for status scores were high with little difference between methods (table 1, online supplementary appendix 2 for separate joint groups). For change scores the intrareader SDCs were good, with reader 3 showing higher SDCs than the other two readers (table 2). Over both follow-up periods the method with the best reliability varied between readers. Inter-reader SDCs were lowest for VV, although differences from the other methods were small (table 2). Looking at separate reader pairs showed heterogeneity among readers with one reader scoring differently from the others (data not shown). Analysis in separate joint groups showed comparable results for comparison between methods (online supplementary appendix 3).

Table 1

Reliability for status scores for the Kellgren–Lawrence grading scale (KL), OARSI atlas (OARSI) and Verbruggen–Veys anatomical phase score (VV) expressed by intraclass correlation coefficient (ICC)

Table 2

Reliability for change scores and sensitivity to change assessed by the smallest detectable change (SDC) and percentage of patients with progression above the SDC for the Kellgren–Lawrence grading scale (KL), OARSI atlas (OARSI) and Verbruggen–Veys anatomical phase score (VV)

Based on the inter-reader SDC KL detected most progression (table 2). This was found for all three readers, although the percentages of progression varied between them. The results in the separate joint groups were similar (online supplementary appendix 4).

The global scoring methods, KL and especially VV, were fastest to perform and scoring individual features with OARSI took more time (table 3). Each method took more time to perform in patients with higher levels of structural abnormalities.

Table 3

Performance time for each set of three hand radiographs and the association between performance time and radiographic score for the Kellgren–Lawrence grading scale (KL), OARSI atlas (OARSI) and Verbruggen–Veys anatomical phase score (VV)


This study on the reliability, sensitivity to change and feasibility of three radiographic scoring methods for hand OA shows minor differences between the methods. Reliability was high and sensitivity to change was good over both time periods, with slightly higher values for KL. There were differences in change scores and proportions of progression between readers, despite use of methods to enhance consistency. VV was the quickest method to perform.

To our knowledge, only one previous study has compared the clinimetric properties of radiographic scoring methods in hand OA, showing equal performance for reliability and sensitivity to change over 1 year.11 Reliability was high in that study. Sensitivity to change expressed by standardised response means was low, whereas we found it to be good based on the SDC. Because different methods were used, meaningful comparison is difficult.

We used the SDC to assess reliability of change scores since it was more suitable than the ICC. The ICC is a measure of relative agreement reflecting signal-to-noise ratio. Therefore it is sensitive to relative subtle inter-reader discrepancies if the total range of scores is narrow, which was the case in this study.

We found that the global scoring methods VV and KL were faster to perform than OARSI. Recently, it was shown that scoring osteophytes, JSN, malalignment and erosions may be sufficient to differentiate subjects according to disease severity.16 This may improve the ease of use of OARSI.

There were differences between readers, despite a training session before starting the study, discussion sessions and use of atlases. The multicentre international study design might have contributed to this finding. The differences did not lead to inconsistency in the comparison of methods. Clinical trials frequently involve multiple international centres, and the use of a central reading centre for radiographs therefore seems appropriate. The question remains: what is the true amount of structural abnormalities in OA? Experts in the field involved in this study scored a range of radiographic OA pathology together and concluded that it is challenging to define a true score owing to variation in interpretation between readers. The use of quantitative measures—for instance, measurement of joint space width, reduces interperson interpretation considerably. Using mean scores from multiple readers will on average be close to the ‘truth’ and increase precision and generalisability.

This study has a number of potential limitations. First, the level of radiographic abnormalities at baseline was relatively low compared with other samples from patients with hand OA. Although this has no effect on the comparison of methods, they may perform differently in other hand OA phenotypes. Second, we scored in chronological order. This may lead to overestimation of progression, but also to higher sensitivity to change.17 Since potential overestimation will occur for all scoring methods it has no influence on the conclusions.

In conclusion, based on our findings it is not possible to recommend one of the scoring methods. Rather, based on the different character of the methods, the choice depends on the study objective. Further research on the validity of radiographic scoring methods as well as possibilities for their modification in order to enhance reliability, sensitivity to change and ease of use is warranted.


The authors would like to acknowledge support of the cooperating hospitals (Bronovo Hospital, The Hague: Dr ML Westedt; Jan van Breemen Instituut, Amsterdam: Dr D van Schaardenburg; Leyenburg Hospital, The Hague: Dr HK Ronday and Dr LN Coene; Reinier de Graaf Gasthuis, Delft: Dr AJ Peeters; Rijnland Hospital, Leiderdorp: Dr EJ van Langelaan) and referring rheumatologists, orthopaedic surgeons and general practitioners.


View Abstract

Review history and Supplementary material


  • Handling editor Johannes WJ Bijlsma

  • Funding The GARP study was financially supported by the Dutch Arthritis Association and Pfizer (Groton, Connecticut, USA).

  • Ethics approval This study was conducted with the approval of the Leiden University Medical Center.

  • Competing interests Hans Bijlsma was the handling editor for this article.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.