Performance of the 2010 ACR/EULAR classification criteria for rheumatoid arthritis: a systematic literature review
- 1Department of Medicine III, Division of Rheumatology, Medical University Vienna, Vienna, Austria
- 2Sections of Clinical Epidemiology Research and Training Unit, and Rheumatology, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, USA
- 3Second Department of Medicine, Hietzing Hospital, Vienna, Austria
- Correspondence to Dr Helga Radner, Division of Rheumatology, Department of Medicine III, Medical University Vienna, Währinger Gürtel 18-20, Vienna 1090, Austria;
- Accepted 19 March 2013
- Published Online First 16 April 2013
Background The 2010 ACR/EULAR classification criteria for rheumatoid arthritis (RA) were developed to improve the identification of individuals for studies of RA. We aimed to summarise the performance of the criteria based on the published literature.
Methods We performed a systematic literature search to identify all studies investigating the 2010 criteria and reporting data allowing to calculate sensitivity (SENS), specificity (SPEC), and positive and negative predictive values. Where possible, meta-analysis was performed.
Results Seventeen full articles (total 6816 patients) and 17 meeting abstracts (total 4004 patients) fulfilled the inclusion criteria. Pooled sensitivity and specificity for RA (defined by different reference standards) were 0.82 (95% CI 0.79–0.84) and 0.61 (0.59–0.64). Results were comparable for different reference standards: for initiation of methotrexate pooled sensitivity was 0.85 (0.83–0.86) and specificity was 0.52 (0.49–0.54); for initiation of any disease modifying antirheumatic drug they were 0.80 (0.79–0.82) and 0.65 (0.61–0.68), respectively; and for expert opinion 0.88 (0.86–0.90) and 0.48 (0.35–0.52). No differences were observed for use of different types of joint counts. Eight studies and five meeting abstracts directly compared 1987 and 2010 criteria using different reference standards within different target populations showing higher overall sensitivity (+0.11 compared with 1987 criteria) at the cost of lower overall specificity (–0.04).
Conclusions Two years after their publication, the 2010 ACR/EULAR criteria have been widely tested in the community. They are sensitive to detect cases of RA among various target populations, independent of how the latter is referenced.
Rheumatoid arthritis (RA) is the most prevalent chronic systemic inflammatory rheumatic disease.1 ,2 Appropriate and reliable classification is relevant for any study done in this disease. As evidence regarding the benefits of timely treatment initiation has accumulated in the past two decades, studies have shifted focus to early stages of the disease. As a consequence, in 2010, new classification criteria were published,3 which differed from the prior 1987 criteria4 in several ways. Most importantly, the existence of joint damage as a classification feature was no longer required, since the aim of treatment now was to avoid this adverse outcome. Rheumatoid nodules were also no longer included, since they likewise signify established, long-standing RA. Morning stiffness was eliminated because of its lack of specificity for RA, while acute-phase reactants and a level-dependent consideration of autoantibody markers, including anticitrullinated peptide antibodies (ACPA), were newly included in the RA classification criteria. Importantly, the approach to evaluation of joint symptoms was modified in the 2010 criteria, but similarly to the 1987 criteria, more weight was given to small joints (finger/foot joints) than large ones, and to a greater number of affected joints.
Given the differences in comparison with the 1987 criteria, understanding the performance of the 2010 criteria is highly relevant, as populations identified by these for future studies are likely to be different from those classified by the 1987 criteria, and thus, trial and cohort study results may not be easily comparable with the existing literature. In particular, the 2010 criteria include several specifics that may be relevant for their application, but their effects on subject classification are not known. These specifics include the target population that is tested with the criteria, which has been defined in the criteria as patients with at least one clinically swollen joint (ie, evidence of definite synovitis) which cannot be better explained by another diagnosis; or the fact that criteria can be fulfilled not only at a single time point, but also cumulatively over time. Further, the effect of the extent of joint assessment (eg, 28 joint count (JC) vs 66/68 JC) on the performance of the criteria is not clear, nor is the effect of using additional imaging tools to ascertain joint involvement, such as ultrasonography or MRI, as optionally suggested.3
Since the publication of the 2010 ACR/EULAR classification criteria, numerous articles and conference proceedings have examined their performance. This systematic literature review has three main objectives: first, to identify the performance of the 2010 criteria if correctly applied as suggested in the original publication; second, to explore the effects of modifications in target populations and the detailed assessment of the criteria items on classification results; and finally, to directly compare the results obtained upon classification when using the 2010 or the 1987 criteria.
A systematic literature review was performed in the three main databases including Medline, EMBASE and Cochrane Central Register of Controlled Clinical Trials using OVID. A comprehensive search strategy with limitation of retrieved articles to publication of the new 2010 ACR/EULAR criteria was developed based on the PICO concept: (for search strategy, see online supplementary table S1). In addition, abstracts of the Annual Meetings of the American College of Rheumatology (ACR) from 2010 to 2011, as well as of the European League Against Rheumatism (EULAR) from 2010 to 2012 were searched to identify relevant studies that are not yet fully published. The literature search was last updated on 27 October 2012. One reviewer (HR) screened each title and abstract of studies identified, reasons for exclusion was documented. In case of uncertainty about inclusion or exclusion criteria, a second reviewer (DA) was involved. We included all studies in which 2010 ACR/EULAR classification criteria were used and compared with various reference standards, or in which different approaches to the ascertainment and/or use of the criteria were investigated, such as different JCs, use of MRI, cumulative fulfilment, or different target populations.
Information from all included studies regarding study design, characteristics of the study population, performance of the new classification criteria, as well as reference standards, were extracted using standardised data extraction forms. Authors were contacted for additional information or data provision when possible. The methodological quality assessment was performed (for published papers only) according to the modified version of the Quality Assessment of Diagnostic Accuracy (QUADAS) tool, which was adopted by the Cochrane Collaboration for use in systematic reviews5–7 (see online supplementary figure S1).
The raw data of patients fulfilling different criteria and reference standards were extracted, and 2 × 2 contingency tables were constructed in order to calculate sensitivity, specificity, as well as positive and negative predictive values. Homogeneous groups of studies were summarised by meta-analysis, and pooled estimates of sensitivity and specificity including their 95% CIs were calculated. For most analyses, we present results also stratified by reference standard. If in subgroup analyses no pooling was possible due to lack of sufficient raw data we calculated mean sensitivity/specificity and their SDs. We used Review Manager V.5.1 for data preparation and meta-analyses, and used forest plots to depict the variation of sensitivity and specificity.
The primary analysis was on all studies that had applied the 2010 classification criteria exactly as suggested in the original publication,3 that is, in patients with at least one clinically swollen joint, in whom synovitis was not better explained by another disease, and in whom presence of typical erosions directly classified RA even if the result on the scoring algorithm was negative. We then investigated whether the choice of target populations other than as defined above affects the overall sensitivity and specificity of the 2010 ACR/EULAR criteria.
Since reference standards and case definitions of RA differed considerably across the different studies, we next summarised and compared the diagnostic test properties of the 2010 criteria with regards to all different reference standards used. In a subsequent step, we also examined the impact of the use of different JCs, various imaging techniques to identify synovitis, or evaluation of the criteria cross-sectionally versus cumulatively over time on the test performance characteristics of the 2010 criteria. Finally, we evaluated the performance of both the 2010 and 1987 criteria directly against the same reference standards.
Study characteristics and quality
A total of 750 articles were retrieved by the search strategy. After first title and abstract screening, 703 articles were excluded and, after detailed review another 34 articles were excluded, while another four articles were found via hand search of reference lists and included (see online supplementary figure S2). Reasons for exclusion were lack of application of the 2010 ACR/EULAR criteria, or lack of comparison to another reference standard. In total, this resulted in 17 full articles for analysis.8–22 In addition, 19 meeting abstracts were found that had not yet resulted in publications as full articles, of which 17 presented sufficient data to be included for further analyses.23–40
The 17 full articles comprised 6688 patients, and the 17 meeting abstracts 4488 patients. Nine full articles9–13 20–22 ,41 and 11 meeting abstracts24 ,25 ,27 ,29 ,30 ,32–35 ,37 ,38 were on patients with early arthritis (disease duration <24 months); four articles8 ,17 ,18 ,42 and three abstracts23 ,26 ,40 on heterogeneous cohorts of patients ‘with joint symptoms’ (including tenderness, pain, loss of movement, or morning stiffness with or without swelling; see table 1); one article14 and two abstracts31 ,39 on cohorts of patients with undifferentiated arthritis; three articles15 ,16 ,19 and two abstracts25 ,36 on cohorts of patients with RA according to the 1987 ACR criteria.
These studies used different reference standards for assessing the performance of the 2010 criteria (table 1): initiation of methotrexate treatment8–10 ,13 ,20 ,24 ,37 ,41; initiation of any disease modifying antirheumatic drug (DMARD)10 ,13 ,20 ,22 ,38 ,41; diagnosis of RA by a specialist24 ,27 ,29 ,30 ,32 ,34 ,35 ,40–42; diagnosis by an office-based rheumatologist plus initiation of DMARD therapy17 ,21; confirmed RA diagnosis after 20 years16; persistent disease (definition see table 1) after one,8 five20 or 1010 years; diagnosis of RA by experts plus fulfilment of 1987 ACR criteria.18 Some publications offered more than one reference standard for analysis. Study characteristics including their populations and reference standards applied are summarised in table 1. Quality of the included published studies varied, but 14/17 studies (82.3%) demonstrated good quality in the majority of the 11 items of the modified QUADAS (detailed quality assessment shown in online supplementary figure S1). Similar quality assessments of the abstracts could not be performed due to insufficient information.
Results of main analysis
Five full articles9 ,11 ,20 ,21 ,41 and one meeting abstract35 (among other analyses) provided data on the performance of the 2010 criteria if used in the appropriate intended population AND with the consideration of x-rays (full algorithm).3 The reference standards used were expert opinion after 1011 or 6 years,35 expert opinion plus initiation of DMARD therapy after 2 years,21 erosive disease after 3 years,9 and either initiation of methotrexate (MTX)/DMARD or expert opinion.9 ,20 ,41 Pooled sensitivity and specificity of full articles irrespective of reference standard were 0.82 (0.79–0.84) and 0.61 (0.59–0.64), respectively (the meeting abstract did not provide sufficient data for pooling).
Effects of use of the 2010 criteria in different target populations
To compare sensitivities and specificities of the 2010 ACR/EULAR criteria within different target populations using all studies (irrespective of gold standard used), we calculated mean values as pooling was not possible due to lack of raw data from conference abstracts, and due to heterogeneity. Mean sensitivities and specificities (±SDs) were 0.68±0.17 and 0.69+0.15 when performed within patients with any joint symptoms, with or without presence of arthritis (four full articles,8 ,17 ,18 ,42 three meeting abstracts23 ,26 ,40); they were 0.79±0.1 and 0.73±0.11 when performed within arthritis patients only (four full articles,11 ,17 ,21 ,22 five meeting abstracts27 ,32–34 ,38); mean sensitivity was 0.78±0.10 and mean specificity 0.68±0.07 for performance within arthritis patients in whom synovitis was not better explained by another disease (five full articles,10 ,12 ,17 ,20 ,22 two meeting abstracts24 ,30); and finally 0.78±0.13 and 0.59±0.16 within arthritis patients in whom synovitis was not better explained by another disease, and considering x-rays (full algorithm): (six full articles,9 ,11 ,13 ,20 ,21 ,41 one meeting abstract35) (figure 1).
Three articles directly addressed within the same study the question of how the use of criteria in different target populations may affect their performance.21 ,11 ,17 They found no major differences in sensitivity and specificity across different target populations. Their detailed results are summarised in online supplementary text S1.
Performance of the criteria with use of different reference standards
We summarise the effects of the use of different reference standards on the performance of the criteria below. These analyses were not limited to those studies that used the appropriate target population (as shown in the main analysis).
Initiation of MTX treatment
Seven full articles8–10 ,13 ,20 ,22 ,41 and two meeting abstracts24 ,37 comprising 5369 patients with early arthritis used initiation of MTX within 6,13 128 ,9 ,20 ,22 ,41 or 18 months10 as reference standards. We found a range of sensitivities from 0.68 to 0.88, and of specificities from 0.32 to 0.72; the pooled sensitivity and specificity, for which we were able to only include full articles (n=3493) were 0.85 (0.83–0.86) and 0.52 (0.49–0.54), respectively (figure 2). The pooled estimates have to be interpreted carefully, though, as they do not account for the heterogeneity of other aspects in these different studies, such as the types of JCs used (28 joints,9 ,13 ,20 40 joints,8 66/68 joints10 ,22 or 76 joints41; meeting abstracts: not stated); or the time point (6–18 months) of MTX initiation.
Initiation of any DMARD
In total, five full articles and one meeting abstract38 comprising 4040 patients with early arthritis using initiation of any DMARD therapy within six,13 1220 ,22 ,38 ,41 or 1810 months as reference standard were included, demonstrating a range of sensitivity from 0.62 to 0.86 and specificity from 0.46 to 0.78. The pooled estimates of all full articles (n=2869) were 0.80 (0.78–0.82) for sensitivity and 0.65 (0.61–0.68) for specificity (figure 2). Again, the pooled estimates did not take aspects of additional heterogeneity into account.
Diagnosis of RA based on expert opinion
In total, nine full articles and eight meeting abstracts used expert opinion as reference standard, of which 13 studies were performed in early arthritis cohorts,9 ,11 ,13 ,22 ,24 ,27 ,29 ,30 ,32 ,34 ,35 ,40 ,41 one in patients with undifferentiated arthritis,14 one in patients with joint symptoms (see table 1),42 and two in patients with confirmed RA.16 ,19 Given the large heterogeneity in populations, sensitivities ranged from 0.66 to 0.97, and specificities ranged from 0.38 to 0.97. The pooled estimates for full studies within early arthritis patients were 0.88 (0.86–0.90) for sensitivity and 0.48 (0.45–0.52) for specificity (figure 2).
Fulfilment of 1987 ACR criteria
Four published studies12 ,15 ,19 ,41 and three meeting abstracts31 ,36 ,39 used 1987 ACR criteria as reference standards in different cohorts. Jung et al15 and Rintelen et al36 applied the 2010 criteria in a population already fulfilling 1987 ACR criteria demonstrating a sensitivity of 0.92 and 0.59, respectively. Tamai et al39 and Meric et al31 used undifferentiated arthritis (not fulfilling 1987 ACR criteria at baseline; Meric et al31 used a small cohort of 32 ACPA positive patients with undifferentiated arthritis) and calculated sensitivity and specificity of the 2010 criteria at baseline for fulfilment of the 1987 criteria after 1 year (0.67 and 0.54) or 2 years (0.86 and 0.1). When the 2010 ACR/EULAR criteria were applied in cohorts of early arthritis patients, Salehi et al,19 de Hair et al,12 and Reneses et al41 found a sensitivity of 0.76, 0.88 and 0.94; and specificity of 0.73, 0.83 and 0.53, respectively, for fulfilment of the 1987 ACR criteria.
Other reference standards used
Several other reference standards were applied, including diagnosis by an office-based rheumatologist plus DMARD initiation,17 ,21 rheumatologists’ diagnosis after 20 years,16 persistent disease after 1,8 5,20 or 10 years11; or diagnosis of RA by experts plus fulfilment of 1987 ACR criteria as reference standard.18 All sensitivities and specificities, positive and negative predictive values of these subgroup analyses are summarised in table 1.
Performance of the criteria according to different methods to assess the criteria components
Use of different JCs
Three full articles used a 28 JC9 ,13 ,20; two full articles16 ,21 and one meeting abstract25 used a 40 JC; three articles8 ,11 ,20 and one abstract36 used a 44 JC; eight full articles10 ,12 ,14 ,17–20 ,22 and one meeting abstract26 performed 66/68 JC; and one full article41 used a 76 JC. Two articles15 ,42 and 13 meeting abstracts did not state which JC was used (table 1). To avoid multiple subgroup analysis with few studies, we performed the pooled analysis of all studies in early arthritis patients that used initiation of MTX as reference standard. Among these, we pooled sensitivities and specificities across different types of JCs. As can be seen from figure 3, there were no major differences, although, again, these analyses did not account for other aspects of heterogeneity.
Using MRI synovitis in addition to count involved joints
Two meeting abstracts26 ,38 addressed the question of whether information from MRI findings on calculating CJs would change sensitivity or specificity of 2010 ACR/EULAR criteria in undifferentiated arthritis patients. Both abstracts showed similar or slightly higher sensitivity but lower specificity when using MRI findings in addition to normal clinical JC (table 1).
Use of cross-sectional versus cumulative fulfilment of the criteria
Two studies,10 ,12 both in early arthritis cohorts, applied the 2010 ACR/EULAR criteria cross-sectionally at baseline, and compared that with ascertainment of the criteria cumulatively over time. Both studies found slightly higher sensitivity but lower specificity when criteria were applied cumulatively compared with cross-sectionally (see online supplementary text S3).
Comparison of the 2010 ACR/EULAR criteria with the 1987 ACR criteria
Eight full articles9–11 ,17 ,20 ,21 ,41 ,42 and four meeting abstracts24 ,27 ,30 ,35 compared sensitivity and specificity of the 2010 ACR/EULAR criteria with those of the 1987 ACR criteria in different cohorts using different reference standards. Calculating the difference of sensitivity and specificity between the 1987 ACR criteria and 2010 ACR/EULAR criteria within each study, we found a range of 6% lower to 27% higher sensitivity of 2010 ACR/EULAR criteria; and of 30% lower to 10% higher specificity of 2010 ACR/EULAR criteria compared with 1987 ACR criteria (results are summarised in figure 4). On average, in an unselected patient population, there were no differences of sensitivity and specificity between the two criteria; when excluding patients with other diagnosis, the 2010 ACR/EULAR criteria demonstrated almost 21% higher sensitivity compared with 1987 ACR criteria, whereas specificity was 16% lower. Application of the full algorithm (arthritis patients excluding patients with synovitis better explained by another disease, and taking into consideration RA-typical erosions) leads to higher sensitivity (+11%) and comparable specificity (−4%).
Only 2 years after publication of the 2010 ACR/EULAR Classification Criteria for Rheumatoid Arthritis, a large number of studies investigating their diagnostic testing properties have been published or presented at conferences. However, these assessments have been performed in various ways and different populations, leading to substantial heterogeneity, making the evaluation of their performance difficult for the community. In our study, the overall sensitivity of the criteria was 82% and overall specificity was 61%, when applied to the intended target population. The effect of different reference standards was small, that is, within a 10% margin for both sensitivity and specificity: sensitivity was highest in regards to expert diagnosis, and specificity was highest regarding DMARD use/initiation. The standard used in the original publication, use/initiation of MTX, was in between for both.
Many studies used a target population that was divergent from the one stipulated in the original publication of the 2010 criteria. From the data analysed, it seems that the most important step is the limitation of the target population to patients with clinical arthritis. Both sensitivity and specificity were higher when compared with a target population with any joint symptoms (with or without synovitis). The limitation to patients in whom arthritis is not better explained by another disease, or the consideration of x-rays did not seem to have a major impact on the performance, although the latter feature was likely applied in a very heterogeneous manner in the different studies: in the 2010 criteria, the presence of RA erosions was recognised as prima facie evidence of RA, precluding the need for applying additional criteria, and a respective algorithm has been provided.43 The impact of erosive changes on classification may, however, be minimal, as the frequency of erosive patients with a score of <6/10 in the 2010 criteria is very low.44
Furthermore, the 2010 ACR/EULAR criteria are relatively robust towards the use of different JCs, which may be related to the fact that the acceptance of tenderness as ‘active’ joint may already create a sensitive tool to evaluate joint assessment, even when reduced JCs are used. Likewise, although this is based on only two meeting abstracts to date, the consideration of synovitis by MRI did not increase the sensitivity or specificity of the 2010 criteria.
The impetus for developing new classification criteria was to facilitate the recognition of RA at an earlier stage compared with the 1987 ACR criteria, thereby enabling earlier enrolment into studies. Studies directly comparing the two criteria applying the full algorithm revealed that sensitivity was 11% higher and specificity was 4% lower for the 2010 criteria than for the 1987 criteria. In other words, nine patients need to be tested with the new 2010 criteria (as opposed to the 1987 criteria) to identify an additional true positive, and 25 patients need to be tested with 1987 criteria (as opposed to the 2010 criteria) to identify an additional true negative. Given that spontaneous remission is rare in patients with arthritis of ≥8 weeks duration,45 and that patients with persistent inflammatory arthritis that is not RA still require appropriate therapy, the negative impact of potential overtreatment may be considered acceptable, while the positive impact of introducing effective treatment in otherwise false negative patients is more than timely and the core of the rationale that was the basis for development of the 2010 criteria.
There are several challenges and limitations in interpreting so-called validation studies for criteria that have the purpose of classification of disease. We showed in several sensitivity analyses that the impact of heterogeneity on the results is small, although many of these small changes may potentially add up. Particularly, applying criteria to unintended populations could conceivably alter the performance of the criteria. A further challenge is that there is no true gold standard to compare with: studies using physician/rheumatologist opinion as the gold standard can be expected to represent the 1987 criteria, given that physicians’ conceptual paradigm of the disease entity is largely influenced by these criteria. For this very reason, the 2010 ACR/EULAR classification criteria for RA avoided physician/rheumatologist opinion as a gold standard to circumvent a potential circularity in reasoning. The proxy of MTX or DMARD initiation is also not accurate in all cases, but in sensitivity analyses here showed to be a good compromise. Finally, with appropriate treatment, those who fulfil the 2010 criteria may never progress further to fulfil the 1987 criteria, and indeed, it was one of the aims in the development of the 2010 criteria to allow for prevention of joint damage and extra-articular RA by using early RA cohorts for their derivation. Thus, the performance of the 2010 criteria against fulfilment of the 1987 criteria should be interpreted cautiously, even if, as seen in this meta-analysis, they withstood that challenge. In general, one limitation was that many meeting abstracts, which provided only limited data about study quality and design, were included in our study, and could not be integrated into summary estimates.
Taken together, the 2010 criteria appear to be relatively robust to different ways of their application, and are more sensitive than the 1987 criteria at the cost of a slight decrement in specificity, which might increase the possibility that a few non-RA patients are classified as RA patients and, for example, entered into clinical trials. Nevertheless, to even more definitively appreciate their value, further research will be needed, for example, to understand the relevance/importance of MRI or ultrasound on the detection of additional joints, for which at this stage, only abstracts are available. For future studies it will be particularly interesting to learn how patient populations identified by the old and the new criteria differ in their therapeutic responses in clinical trials, as this is the predominant usage of classification criteria. In conclusion, the performance of the 2010 criteria in a large number of international studies, and in early arthritis cohorts in particular, suggests that the major intention of their development has already materialised.
The authors acknowledge authors and coauthors of included study for kindly providing raw data and answering further questions.
Handling editor Tore K Kvien
Contributors TN: analyses, writing the manuscript; JSS: analyses, writing the manuscript; DA: SLR, analyses, writing the manuscript.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.