Background To describe rheumatoid arthritis (RA) worsening that leads to change or re-initiation of treatment, several Disease Activity Score 28 (DAS28)-based flare criteria have been described, but none validated.
Methods Six previously published DAS28-based flare criteria ((1) increase in DAS28 >1.2, or >0.6 if DAS28 >5.1; (2) increase in DAS28 >1.2, or >0.6 if DAS28 ≥3.2; (3) increase >0.6 or DAS28 >3.2; (4) increase in DAS28 >1.2; (5) DAS28 >3.2; (6) DAS28 >2.6) were tested against five hypotheses concerning criterion and construct validity: (1+2) Sensitivity and specificity >70% compared with patient's/physician's judgment; (3) difference in proportion with disease modifying anti-rheumatic drug/corticosteroid initiation/increase >0.2; (4) mean difference in C-reactive protein (CRP) >10 mg/l; and (5) no statistical difference in Short Form-36 Mental Health subscale change. Three different RA patient databases in which flare might occur were used. Sensitivity/specificity, χ2 and two-sample student t test analyses were done.
Results The analyses included 51, 147 and 744 RA patients, from the three databases. Criterion 2 fulfilled most hypotheses: 4 out of 5. Sensitivity and specificity varied between 63%–78% and 84%–92%. Construct validity was demonstrated with 23% more treatment change, higher mean CRP (11.4 mg/l) and depression scale change of −5. Criteria 3, 5 and 6 were more sensitive, criteria 1, 2 and 4 more specific.
Conclusions An increase in DAS28 >1.2 or >0.6 if DAS28 ≥3.2 appears most discriminating and valid by our predefined validation criteria. Considering the other criteria, sensitivity and specificity shown here might facilitate use in different settings.
Statistics from Altmetric.com
In rheumatoid arthritis (RA) assessment, the emphasis has been placed on determining response to therapy and measurement of states of adequate disease activity control, rather than on worsening of disease activity (flare). Indeed, several validated criteria for improvement are being used in randomised controlled trials (RCTs) and longitudinal observational studies (LOS), among them criteria based on relative improvement (American College of Rheumatology (ACR) response) and criteria based on relative and absolute improvement from baseline (European League Against Rheumatism (EULAR) response). Next to improvement criteria, absolute targets for low disease activity and remission have also been validated (Disease Activity Score (DAS), DAS28, ACR remission criteria).1–8 However, increasing numbers of RCTs and LOS are evaluating treatment strategies that include optimisation, tapering and withdrawal of biologic and traditional DMARDs, and thus, there is a need for valid measures to determine RA flare or worsening.1 ,7 ,8
Flare criteria in RA are essential for some types of studies, and important for other types described below. First, flare criteria are necessary to evaluate duration of response, with subsequent need for and timing of retreatment after ‘fire-and-forget’ types of treatment like rituximab.9 In addition, when evaluating two different treatment regimens, the number of flare events (including duration and intensity) could be a valuable secondary outcome measure in RCTs, besides response at a certain time point. Most importantly, validated flare criteria are indispensable for evaluating the effects of dose down-titration or withdrawal of medication in patients with low disease activity or remission.10 This is a topic of growing importance as more widespread use of biologics is limited by costs, and as there is continuing uncertainty about possible long-term side effects.11 In addition to the use of such criteria in RCTs and LOS, they could also provide evidence and guidance for treatment optimisation and changes in clinical practise.
Several unvalidated RA flare criteria have been used in clinical studies or have been proposed in the literature; these criteria vary considerably. Examples include an increase in arthritis activity determined by physician's decision to change treatment as described by Bingham et al,1 or any disease exacerbation whether transient or persistent as described by Berthelot et al,12 or worsening of components of the ACR response criteria, or worsening based on the inverse of EULAR response criteria. Focusing on the DAS for a 28-joint count (DAS28)-based flare criteria, again several variations have been described, but until now no validation of these criteria has been published.13–20 Therefore, we set out to examine the performance on criterion and construct validity of DAS28-based flare criteria that have been used in published studies.
To investigate criterion and construct validity of the DAS28-based flare criteria, we examined four candidate databases to determine usefulness according to the presence of at least one non-DAS28 indicator of RA worsening data and available DAS28 data.15 ,21–23 Three databases met our criteria. Five hypotheses were formulated using the quality criteria for measurement properties by Terwee et al24 as a guidance. Each hypothesis was tested in at least one database:
▸ Database 1: Infliximab observational down-titration study of the Sint Maartenskliniek Nijmegen (n=51 patients), where infliximab was de-escalated in RA patients with low DAS28 until disease activity worsened or infliximab was stopped.21
▸ Database 2: A longitudinal, observational cohort of RA patients treated with infliximab for at least 6 months (n=147 patients).23
In both databases 1 and 2, at each visit data was collected on DAS28 and a transition question about disease activity.
▸ Database 3: The NOR-DMARD, database which is a registry of Norwegian patients with rheumatic diseases starting treatment with synthetic or/and biologic DMARDs (patients with RA n=3612).22 Further detail on the databases used is provided in table 1.
The following DAS28-based flare criteria have been used in published studies and were therefore included in this validation study:
3. increase of >0.6 or a DAS28 >3.216
4. increase in DAS28 >1.217
5. reaching a DAS28 >3.218
These criteria were identified by means of a literature search using the terms: flare, worsening, RA, DAS28. Additional references were identified from bibliographies within these publications, ACR, EULAR meetings and individual investigators.1 Each flare criterion was tested against five hypotheses (as described below). We postulated that at least four out of five hypotheses should be met to conclude that a flare criterion has sufficient construct and criterion validity.24
To investigate criterion validity, we set out to compare the DAS28-based flare criteria to a concurrent gold standard. Since, unfortunately, no gold standard is available, a transition question completed by patient and physician was used as a proxy. The transition question asked whether disease activity had worsened, remained unchanged or improved compared with the last visit by means of a 7-point Likert scale (databases 1+2).
Two hypotheses were formulated:
Hypothesis 1: Sensitivity and specificity for the different DAS28-based flare criteria exceeds 70% compared with the judgment of the patient, operationalised by the transition scales of ‘worse’ or ‘much worse’.
Hypothesis 2: Equal to hypothesis 1, only compared with physician's judgement.
Construct validity: discriminative validity
The construct of the DAS28 validity for measuring disease activity state in RA has been extensively validated and clinically important improvement has also been validated in the development of EULAR moderate and good response definitions.4 ,25 However, clinically relevant disease activity worsening has not been validated. Because the proposed flare criteria included several different cut-off levels for absolute disease activity and change in disease activity, we chose to also reassess construct validity. To investigate discriminative validity, we hypothesised that in RA patients changing therapy a significantly higher proportion will fulfil the flare criteria, operationalised as follows using database 3:
Hypothesis 3: A significantly higher proportion of patients fulfilling the flare criteria had DMARD initiation/increase or corticosteroid initiation/increase than patients not fulfilling the flare criteria. An acceptable higher proportion was defined as 0.2 or more.
This analysis included only patients with DAS28 >3.2 at entry in the NOR-DMARD database (start of new DMARD treatment) and who had subsequently responded after 3 months of treatment, defined as answering the transition question with ‘improved’ or ‘much improved’ compared with baseline. For this purpose, change in DMARD treatment was defined as DMARD initiation/increase (other than increased dose/reduced interval) or/and any initiation/increase in systemic corticosteroids.
At 6 months, patients with a flare were identified by the number of DAS28-based flare criteria fulfilled and whether or not a treatment change had occurred. According to sample size estimation with α=0.05 and an anticipated statistical power of 80% to demonstrate a proportion difference of at least 0.2, 62 patients were necessary in each group. χ2 Analyses were done to test for homogeneity and to explore whether there was a potential statistically significant difference in patients fulfilling the flare criteria between the groups estimated using the absolute difference in proportions (Δ=π1−π2≈p1−p2).
Construct validity: convergent validity
To assess whether patients classified as having a flare according to the DAS28-based criteria also show change in other variables that would be expected to converge with the RA flare construct, we chose to use change in C reactive protein (CRP).
· Hypothesis 4: In patients fulfilling the flare criteria the mean change in CRP level between present and previous visit is significantly and relevantly higher (mean difference >10 mg/l) compared with patients not fulfilling the flare criteria.
This hypothesis was tested in database 3 in which changes in CRP level between 3 and 6 months were compared between patients meeting flare criteria and those not meeting the criteria at 6 months using a two-sample Student t test. Besides a statistical difference we expected the difference in mean change of CRP between the flare visits and the non-flare visits to be at least 10 mg/l.24
Construct validity: discriminant validity
To test that the DAS28-based flare criteria are not influenced in a relevant way by other constructs besides disease activity, we tested the hypothesis that there is no relevant association between changes in depression and experiencing a flare or not.
· Hypothesis 5: There is no statistical and relevant difference in change in depression state measured with Mental Health subscale of the SF-36 in patients fulfilling the flare criteria and patients not fulfilling the flare criteria.
In patients in the NOR-DMARD database, the SF-36 was completed during follow-up. As a measure of feelings of depression or nervousness the subscales of Mental Health were used. Comparison of mean change in Mental Health subscale score in patients meeting and not meeting flare criteria will be tested with a two sample t test. To ensure a sufficient power to be able to find a difference that is considered relevant (5 points (SD 10.8)), at least 74 patients per group have to be included, according to a sample size calculation with α 0.05 and anticipated power of 80%, expecting a SD of 10.8.26
Demographic data on patients in the three databases used in the validation analyses are described in table 2. Included patients were predominantly female and rheumatoid factor (68%–82%) and anti-cyclic citrullinated protein antibody positive (69%–73%). Cohorts 1 and 2 consisted of RA patients with longer disease duration and had been treated with more DMARDs as expected. In cohort 3, patients had shorter disease duration and had been treated with fewer DMARDs; there were 744 patients fulfilling the inclusion criteria for the analyses as well as having available 6-month follow-up data. In all, 93 of these patients had a change in DMARD/systemic corticosteroid treatment, as defined above for hypothesis 3. Since inclusion in cohort 3 coincided with starting or changing DMARD treatment, a higher mean DAS28 at baseline was observed, but after 3 months of treatment the mean DAS28 was comparable with the other databases (table 2).
In tables 3 and 4, the absolute and relative results for the six different DAS28-based flare criteria in fulfilling the five hypotheses are presented. None of the investigated DAS28-based flare criteria met all hypotheses; criterion 2, however, fulfilled the predefined target of at least four out of five hypotheses. When considering criterion validity for criterion 2, sensitivity and specificity varied between 63%–78% and 84%–92%, respectively. Discriminative validity was demonstrated with a higher proportion of change in RA treatment (difference in proportion of 0.23). Convergent validity was shown with a higher mean CRP of 11.4 mg/l (SD 1.7) in patients defined as having a flare. Last, there was no change in depression scale (−5 (SD 1.2)) supporting discriminant validity.
When considering criterion validity, most flare criteria seem to be either very sensitive or very specific. The criteria only using an absolute DAS28 cut-off to define a flare are very sensitive, but they lack specificity (criteria 5 and 6). If only the amount of worsening in DAS28 is used in the criterion the opposite is observed, with high specificity and lower sensitivity (criterion 4). The criteria using a combination of a threshold and a change over time are more sensitive or more specific depending on whether a threshold of 3.2 or 5.1 is used and whether the change in DAS28 is more than one or two times the measurement error (0.6 or 1.2) (criteria 1, 2 and 3). In addition, there is a rather high correlation between the transition questions of patients and physicians used in hypotheses 1 and 2 (cohort 1: r=0.76, p<0.0001; cohort 2: r=0.82, p<0.0001), resulting in similar sensitivity and specificity results between these hypotheses.
Flare criteria with DAS28 change of more than two times the measurement error fulfil hypotheses 3 and 4 (discriminative and convergent validity) in contrast to criteria not including a change measure or including a change measure of >0.6 (table 4). Criteria including cut-offs for DAS28 state, with or without a small DAS28 change, fulfilled hypothesis 5, which assesses discriminant validity. Discriminant validity of flare criterion 2 is just at the predefined acceptable limit.
To our knowledge, this is the first study to examine the performance of the DAS28-based flare criteria that have been used in published studies in terms of criterion and construct validity. Performance varied considerably, with the only DAS28-based flare criterion that met the predefined validation threshold being an increase in DAS28 >1.2, or >0.6 if DAS28 ≥3.2. In addition, flare criteria including a more stringent DAS28 change measure have higher discriminative and convergent validity, suggesting a better signal to noise ratio. In contrast, the criteria with cut-offs for DAS28 state with or without a small DAS28 change measure performed better on testing of discriminant validity. Overall, construct validity revealed a comparable trend to criterion validity, where the criteria using a DAS28 change measure of >1.2 are more specific and if a cut-off with or without a small DAS28 change of >0.6 is used the criteria have higher sensitivity.
Our study has several strengths. First, a stringent study protocol was developed and followed, closely adhering to validation techniques described by Terwee et al24 and using prespecified endpoints. In addition, four different databases were assessed for use in our analyses. After selection based on availability of necessary data and appropriateness of data collection for our predefined analyses, three databases were used to test the DAS28-based flare criteria. In these databases, RA patients with short and long disease duration were included which in our opinion increases generalisability of our results to determine clinically relevant worsening of RA disease activity.
Due to the elaborated insight into the performance of the different flare criteria, an evidence-based selection can now be made depending on the goals for the use of the flare criterion. For example, in an RA cohort with patients in remission there might be a need for a more sensitive criterion than for example RA patients treated with rituximab who have not reached remission yet but experience a flare after a good initial response. However, since none of the flare criteria completely fulfilled all our predefined hypotheses further research and development of a flare measurement are warranted, with emphasis on the potential role of patient reported outcomes, especially since disease flare may not coincide with assessment visit.
Some limitations of our study and of validation studies in general should be mentioned. First and foremost, no gold standard for RA flare is available. In assessing criterion validity, transition questions completed by patients and physicians were used, which might be considered as a reasonable proxy for a gold standard. We have considered using radiographic damage to assess criterion validity gold standard. However, this reflects a late consequence of flaring of disease activity, thus qualifying more for construct validity. Also, radiographic damage is becoming more infrequent due to limited sensitivity of the measure and better disease control in most RA cohorts. Hypothetically, loss of function could also be used as a gold standard for flare. In the NOR-DMARD database the Modified Health Assessment Questionnaire (MHAQ) was collected, as a validated questionnaire for function.27 In this database, treatment change and patient reported worsening correlate with MHAQ worsening.28 However, no publications are at hand on a threshold for MHAQ change to be defined as having a flare. Finally, intensity of morning stiffness was also correlated to patient reported worsening in NOR-DMARD, but again there is till now no evidence that this parameter is a gold standard for flare. So, in conclusion, patient's judgement on disease activity deterioration seemed the most feasible and reasonable option as a ‘gold standard’.
Another shortcoming could be the lack of involving a definition for persistence of a flare in terms of duration of DAS28 worsening. However, none of the databases available included data to enable this analysis. If the OMERACT definition of flare in RA is taken into account, a flare needs to include duration of disease worsening. We analysed all flares based on single visits. If duration is also included, with for example a second measurement after 1–2 weeks, the reported flare criteria will probably increase in specificity, but lose sensitivity. This requires further study.
In addition, the absolute results for hypothesis 3 are perhaps somewhat disappointing. One might have expected an even higher proportion of patients fulfilling a flare criterion to have had a change in treatment. As anticipated, doctors and patients can be reluctant to change medication for many reasons; for example, waiting for improvement because medication has just started or fearing side effects of alternative medications. Also, from start of patient inclusion in the NOR-DMARD database several years ago to now attitudes have changed with regard to therapeutic strategies. Nowadays, a tight control-based approach is much more common than previously, which will have influenced the proportion of treatment changes in this database. Finally, more hypotheses could have been generated to test validity resulting in more accuracy and also other flare criteria (eg, DAS/CDAI/SDAI/ACR based) could have been included.
In conclusion, based on these results, an increase in DAS28 >1.2, or >0.6 if the DAS28 ≥3.2 seems an adequate flare criterion when considering criterion and construct validity. Also, it appears that DAS28 criteria for disease worsening can be selected according to study requirements for sensitivity and specificity to guide treatment strategies. In the case of studies evaluating dose-titration and withdrawal, the duration of worsening that should trigger resumption of medication at the previous level is unclear and warrants further study.
The authors would like to thank other members of the OMERACT RA Flare Definition Working Group for their suggestions and comments on the study design, including scouting for appropriate databases to test our hypotheses. In addition, we like to thank all collaborators to the NOR-DMARD registry and to the two databases of the department of rheumatology of Sint Maartenskliniek Nijmegen.
Handling editor Hans WJ Bijlsma
Contributors AvdM contributed to the study design, data collection, data analysis and interpretation, and drafting the manuscript. EL also contributed to the study design, data collection, data analysis and interpretation. In addition, EL revised the manuscript critically. TW and AB contributed to the study design, data collection, data interpretation and they revised the article critically. RC contributed to the study design, data analysis and revised the manuscript critically. EC, YdM and PvR contributed to the design of the study and critical revision of the manuscript. All authors gave their final approval of the version to be published.
Funding The Norwegian Disease-Modifying Antirheumatic Drug database has received unrestricted grant support from Abbott, Amgen, Wyeth, Aventis, MSD, Schering-Plough/Centocor, BristolMyers Squibb, UCB, Roche and the Norwegian Directorate for Health and Social Affairs.
Competing interests Thasia Woodworth, Robin Christensen and Ernest Choy are members of the OMERACT RA Flare Group Steering Committee.
Patient consent Obtained.
Ethics approval For the data collection in the three separate databases medical ethical has been obtained if necessary according to local legislation.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.