Introduction An ACR/EULAR task force released new criteria to classify rheumatoid arthritis at an early stage. This study evaluates the diagnostic performance of these criteria and algorithms by van der Helm and Visser in REACH.
Methods Patients with symptoms ≤12 months from REACH were used. Algorithms were tested on discrimination, calibration and diagnostic accuracy of proposed cut-points. Two patient sets were defined to test robustness; undifferentiated arthritis (UA) (n=231) and all patients including those without synovitis (n=513). The outcomes evaluated were methotrexate use and persistent disease at 12 months.
Results In UA patients all algorithms had good areas under the curve 0.79, 95% CI 0.73 to 0.83 for the ACR/EULAR criteria, 0.80, 95% CI 0.74 to 0.87 for van der Helm and 0.83, 95% CI 0.77 to 0.88 for Visser. All calibrated well. Sensitivity and specificity were 0.74 and 0.66 for the ACR/EULAR criteria, 0.1 and 1.0 for van der Helm and 0.59 and 0.93 for Visser. Similar results were found in all patients indicating robustness.
Conclusion The ACR/EULAR 2010 criteria showed good diagnostic properties in an early arthritis cohort reflecting daily practice, as did the van der Helm and Visser algorithms. All were robust. To promote uniformity and comparability the ACR/EULAR 2010 criteria should be used in future diagnostic studies.
This paper is freely available online under the BMJ Journals unlocked scheme, see http://ard.bmj.com/info/unlocked.dtl
Statistics from Altmetric.com
Recently an American College of Rheumatology/European League Against Rheumatism (ACR/EULAR) task force released new classification criteria for rheumatoid arthritis (RA) at an early stage.1 These criteria might also have diagnostic value early in the disease process although this has not yet been evaluated. Early diagnosis is important to improve patient outcome by early treatment to prevent joint damage and functional impairment.2
The previous classification criteria for RA (the 1987 ACR criteria) proved inadequate in the early stages of disease.3 4 This led to the development of other diagnostic algorithms.5 6 These algorithms showed good diagnostic performance and identified patients at an early stage of the disease.7,–,9
Diagnostic algorithms tend to be overoptimistic in their capabilities when only tested in the population they were derived from.10 For instance, if a high erythrocyte sedimentation rate (ESR) is an important predictor for RA but in the derivation cohort by chance only a few patients had a high ESR, the data-driven way in which these algorithms are build will not identify this predictor. Therefore before use in practice the discriminative abilities of such algorithms should be tested in another cohorts with similar patients (similar incidence rate). In addition, the robustness of algorithms to variation of incidence rates can be tested in cohorts with different previous disease probabilities.11,–,13
We aim to evaluate the diagnostic performance of the ACR/EULAR 2010 criteria and two diagnostic algorithms simultaneously to predict methotrexate use or persistent disease in the Rotterdam Early Arthritis Cohort (REACH). In addition, we will test robustness after defining two patient sets in the same cohort resulting in different previous probabilities of developing RA.
Clinical data used were from REACH. This ongoing, prospective, inception cohort study was set up in the greater Rotterdam area in July 2004. Patients were recruited either via their general practitioner, or via the outpatient rheumatology clinic of three hospitals at first consultation. Patients were included in case of one or more swollen joint or, in the absence of joint swelling, if they had two or more joints with pain or loss of movement with two or more of the following criteria: morning stiffness for more than 1 h; unable to clench a fist in the morning; pain when shaking someone's hand; pins and needles in the fingers; difficulties wearing rings or shoes; a family history of RA; unexplained fatigue for less than 1 year. Patients were excluded if their symptoms resulted from trauma or overexertion, were for over 12 months, or if they were younger than 16 years.
A trained research nurse at the REACH clinic took a standardised history and conducted a physical examination at baseline, 6 and 12 months, including blood and urine samples. For the current analysis data from baseline and 1 year were used. Physical examination included the measurement of tender and swollen joints, using a 44 joint count. Laboratory variables included IgM-rheumatoid factor (ELISA), anti-cyclic citrullinated peptide (Elia CCP on immunoCAP 250; Phadia Freiburg, Germany),C-reactive protein (local standards) and ESR (local standards). x-Rays of hands and feet were assessed for bony erosions at baseline. For a detailed description of REACH, see Geuskens et al.14
To asses overall performance the prediction algorithms were tested on discrimination and calibration.15 Discrimination is the ability of an algorithm to differentiate correctly between patients with and without the disease. Calibration reveals the ability to estimate the probability of the diagnosis for individuals correctly by comparing the probability predicted by the algorithm and the observed probability. To evaluate discrimination receiver operating characteristic curves, including corresponding areas under the curve (AUC), were calculated. Calibration was evaluated using calibration plots and the Hosmer–Lemeshow test.15 The latter indicates good calibration if a non-significant result appears. To assess diagnostic performance of the algorithms sensitivity, specificity, positive predictive values (PPV) and negative predictive values (NPV) were estimated at cut-points proposed for treatment initiation among patients at risk of RA. For the ACR/EULAR 2010 criteria and Visser algorithm a score of 6 or more1 16 was used and for van der Helm a score of 8 or more5 was used. To test robustness this analysis was repeated among all patients included in REACH. This group had a lower previous disease probability by a case-mix of synovitis and inflammatory joint complaints without synovitis. Synovitis was defined as joint swelling.
As a classifier for correct diagnosis two outcomes were evaluated at 1 year: the use of methotrexate and persistent disease, defined as synovitis present at physical examination after 1 year, or the use of disease-modifying antirheumatic drugs (DMARD) including biological agents. Patients with a definite alternative diagnosis such as gout were not classified as persistent disease. A complete case analysis was done.
Up to 31 October 2008, 875 patients were referred to REACH and had 1-year follow-up. One hundred and 13 patients did not fulfil the inclusion criteria and 31 patients were lost to follow-up at baseline (see supplementary figure S1, available online only). Patients used in the development of the ACR/EULAR 2010 criteria were excluded (n=216).1 Table 1 reports baseline characteristics of all patients (n=513). Patients had a mean age of 50 years, 73% were women and the median symptom duration was 106 days (range 1–366 days). At baseline 48% (n=246) presented with synovitis. After 1 year, 148 of 513 used methotrexate, of whom 22 did not have synovitis at baseline, and 231 of 513 patients had persistent disease, of whom 59 did not have synovitis at baseline.
Table 2 shows AUC of each diagnostic algorithm for both outcomes. In undifferentiated arthritis (UA) patients (n=231) the AUC for methotrexate use were comparable, with overlapping 95% CI, 0.79 (95% CI 0.73 to 0.83) for the ACR/EULAR 2010 criteria, 0.80 (95% CI 0.74 to 0.87) for the van der Helm algorithm and 0.83 (95% CI 0.77 to 0.88) for the Visser algorithm. For persistent disease the AUC were 0.77 (95% CI 0.71 to 0.85) for the ACR/EULAR 2010 criteria, 0.78 (95% CI 0.71 to 0.85) for the van der Helm algorithm and 0.77 (95% CI 0.71 to 0.83) for the Visser algorithm. In all patients (n=513) the AUC were comparable for both outcomes, with slightly better performance of the van der Helm algorithm; 0.88 (95% CI 0.84 to 0.91) and 0.83 (95% CI 0.79 to 0.87).
Calibration plots of all diagnostic algorithms are shown in figure S2 (see supplementary figure S2, available online only). In UA patients (n=513) calibration was worse than in all patients, although the Hosmer–Lemeshow test was not significant for any of the calibration plots. All algorithms calibrated well in all patients (n=513), confirmed by the Hosmer–Lemeshow test.
Evaluating diagnostic performance using proposed cut-points
To identify patients in need of treatment proposed cut-points were tested in UA patients. The ACR/EULAR criteria showed a sensitivity of 0.74 (95% CI 0.65 to 0.82) and a specificity of 0.66 (95% CI 0.54 to 0.76), with the cut-point of 6 or higher using methotrexate as a classifier for correct diagnosis (table 3). The Visser algorithm and the van der Helm algorithm had a lower sensitivity, 0.47 and 0.59 for the Visser algorithm for both outcomes and 0.08 and 0.10 for the van der Helm algorithm. Specificity was higher: 0.93 for the Visser algorithm and 1.0 for the van der Helm algorithm.
The PPV is the probability that a patient has the disease if the test is positive. The van der Helm algorithm had the highest PPV; 1.0. The NPV is the opposite probability and was highest for the ACR/EULAR criteria with 0.63 for methotrexate use and 0.46 for persistent disease, slightly higher than the Visser algorithm.
The results of our study show that the new ACR/EULAR 2010 criteria could aid diagnostics in early arthritis patients. They had good overall performance, with a sufficiently high AUC and good performance of the proposed cut-point of 6 for persistent disease, which could be considered RA. The other algorithms performed well when tested for discriminatory properties (AUC and calibration), although the van der Helm algorithm failed to detect cases at the proposed cut-point. To promote uniformity and comparability of studies we would suggest using the ACR/EULAR 2010 criteria in future diagnostic studies.
The cut-point of 6 in the ACR/EULAR 2010 criteria was well chosen and showed good diagnostic performance, even though it was not intended for diagnostic purposes.1 Choosing a cut-point is a trade-off between harm of treatment in non-cases (overtreatment) and harm of no treatment in cases (undertreatment).13 17 Ideally a cut-point has a high sensitivity to prevent undertreatment and a high specificity to prevent overtreatment. However, a high specificity is often accompanied by a low to moderate sensitivity and vice versa. For the ACR/EULAR 2010 criteria both sensitivity and specificity were approximately 70%. Using this cut-point of 6 to start treatment, in this study 30% of persistent patients would not be treated, whereas 30% of the non-persistent patients would have been. Lowering the cut-point to 4 increases sensitivity to 0.92 at the cost of specificity (0.33). Increasing it to 7 had a sensitivity of 0.53 and a specificity of 0.85. Perhaps creating a low, intermediate and high-risk group for disease using dual cut-points would enable treatment with different intensities.
The cut-point of 6 was chosen using the AUC of three cohorts, including our own. In this study the AUC for methotrexate use, 0.79, was similar to that in the derivation article (0.66–0.82), indicating consistency. It was also similar (0.77) for persistent disease. It could be argued that this is a direct result of the use of our data in the derivation cohort. However, the decision to use 6 as cut-point was based on expert opinion and two other cohorts. Furthermore, patients included in the derivation of the ACR/EULAR 2010 criteria were removed from analyses.
The strengths of our study include the heterogeneity of patients’ subsets to test robustness of the algorithms and simultaneous evaluation of three diagnostic algorithms in one study sample. We showed that the ACR/EULAR criteria and both algorithms were robust in a case-mix of synovitis and non-synovitis patients. Calibration was good for all algorithms, but not perfect. Calibration and robustness have not been evaluated before by others, but discrimination was. The van der Helm algorithm showed AUC of 0.82–0.88 and the Visser algorithm an AUC of 0.70, both similar to the AUC in the present study.7,–,9
This study should be interpreted in the light of current developments in diagnostic research in RA. Current diagnostic studies within RA are faced with defining a suitable outcome. We defined two outcomes; methotrexate use similar to the definition of the ACR/EULAR 2010 and persistent disease (either synovitis or DMARD use at 12 months).1 This may have led to misclassification in two ways. Patients could be classified as true positive because they were still using methotrexate or other DMARD at 12 months, whereas in fact some patients may not need treatment. Likewise, patients may have had episodes of arthritis with no episode or DMARD use at 12 months, while later on they developed persistent arthritis.
In conclusion, the new ACR/EULAR 2010 criteria showed good diagnostic properties in an early arthritis cohort reflecting daily clinical practice, as did the van der Helm and Visser algorithms. All were robust. To promote uniformity and comparability we would suggest using the ACR/EULAR 2010 criteria in future diagnostic studies.
The authors would like to thank all patients and general practitioners who participated as well as the research nurses, trainee rheumatologists and rheumatologists (other than the authors) who contributed to the REACH cohort.
Review history and Supplementary material
Funding Data management was partly funded by the Dutch Arthritis Association.
Competing interests None.
Ethics approval This study was conducted with the approval of the medical ethics committee at Erasmus MC.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.