Aims To compare the British Isles Lupus Assessment Group (BILAG) 2004, the Safety of Estrogens in Lupus Erythematosus National Assessment (SELENA) flare index (SFI) and physician's global assessment (PGA) in assessing flares of disease activity in patients with systemic lupus erythematosus (SLE).
Methods Sixteen patients with active SLE were assessed by a panel of 16 rheumatologists. The order in which the patients were seen was randomised using a 4×4 Latin square design. Each patient's flare status was determined at each assessment using the BILAG 2004 activity index; the SFI and a PGA. A group of five specialists designated each patient into severe, moderate, mild or no flare categories.
Results The rate of complete agreement (95% CI) of the four individual examining physicians for any flare versus no flare was 81% (55% to 94%), 75% (49% to 90%) and 75% (49% to 90%) for the BILAG 2004 index, SELENA flare instrument and PGA, respectively. The overall agreement between flare defined by BILAG 2004 and the SFI was 81% and when type of flare was considered was 52%. Intraclass correlation coefficients (95% CI), as a measure of internal reliability, were 0.54 (0.32 to 0.78) for BILAG 2004 flare compared with 0.21 (0.08 to 0.48) for SELENA flare and 0.18 (0.06 to 0.45) for PGA. Severe flare was associated with good agreement between the indices but mild/moderate flare was much less consistent.
Conclusions The assessment of flare in patients with SLE is challenging. No flare and severe flare are identifiable but further work is needed to optimise the accurate ‘capture’ of mild and moderate flares.
Statistics from Altmetric.com
The recent failure of rituximab and other biological agents to meet their primary end points has highlighted the need to optimise clinical trials design in patients with systemic lupus erythematosus (SLE).1 2 Disease activity in different patients with SLE can be manifested as symptoms and/or signs in a variety of different organs or systems. Therefore, in deciding whether subjects treated with the trial medication have better clinical outcomes than a control group, it is necessary to be able to quantify not only the decrease in disease activity but also an increase in activity in a way that includes all these possible manifestations. The concept of a ‘flare’ of disease activity is very useful in this respect and has been utilised in previous and ongoing studies. This prompts the critical question of how best to define a flare of SLE. Global score systems such as the systemic lupus erythematosus disease activity index (SLEDAI)3 and 4 systemic lupus activity measure have the advantage of simplicity, in that the clinical features in each organ/system are assigned numerical scores that are summated to give a total score for disease activity. The presence of a flare can then be defined according to a prespecified increase in this total score. Therefore, if a patient were to increase her/his score by, say, 4 points, a flare might be deemed to have occurred.5 There are, however, problems with this approach. For example, in the global score system, points are awarded for clinical features if present, but do not distinguish those features that are improving from those that are deteriorating or those that are unchanged. The original SLEDAI only included features that were new or worse, but did not capture ongoing activity, which is clearly a major limitation in a clinical trial. For this reason, the SLEDAI was revised to take account of ongoing activity, and two versions have been used the safety of oestrogens in lupus erythematosus national assessment–SLEDAI score (SELENA–SLEDAI)5 and SLEDAI 2000.6 The SELENA–SLEDAI flare index was developed for use in clinical trials by the SELENA group with the intention of distinguishing severe flares from those that are only mild or moderate. The SELENA group has recently devised a more comprehensive instrument that distinguishes mild from moderate flares, provides separate analysis of flares in different organ systems, and collects treatment data as part of the evaluation.7 The revised SELENA flare index is organ-system based, and is not linked to the SLEDAI. For each organ system, suggested clinical manifestations are given but the categorisation is dictated by the treatment decision. In particular, a ‘mild flare’ is assigned if there is either no treatment, or if there is initiation of hydroxychloroquine, prednisone 7.5 mg per day or less or a non-immunosuppressive therapy. Definition of a ‘moderate flare’ requires the use of prednisone greater than 7.5 mg per day but less than 0.5 mg/kg per day, or immunosuppressive therapy (other than cyclophosphamide), and ‘severe flare’ is defined as prednisone (or equivalent) 0.5 mg/kg per day or greater, cyclophosphamide, biological treatment, or hospitalisation.
The SELENA flare instrument thus shares concepts with the British Isles Lupus Assessment Group (BILAG)6,–,9 index, which for 25 years has been used to define disease activity on the basis of the ‘physician's intention to treat’ in individual organs or systems. Scoring in the BILAG index distinguishes clinical features that are improving from those that are getting worse, staying the same, or are new or recurrent. Instead of giving a single score covering all systems, the BILAG index gives individual scores (from A to E, where A represents the highest disease activity) for eight different systems. There were problems, however, with the ‘classic’ BILAG index, which incorporated a small number of items that were more clearly due to damage than to disease activity and that failed to capture adequately disease activity in the gastrointestinal or ophthalmic systems. The substantially revised version, the BILAG 2004 index, has now been validated and shown to be reliable and sensitive to change.10,–,14 Using the BILAG 2004 index a flare can be defined in terms of the number of systems scoring A or B based on items recorded as new or worse. On this basis one might define a severe flare as occurring in a patient with an A score in any system or a moderate flare as B scores in at least two systems.
A working party organised by the Lupus Foundation of America (LFA) recently defined a flare as a measurable increase in disease activity in one or more organ systems involving new or worse clinical signs and symptoms and/or laboratory measurements. It must be considered clinically significant by the assessor, and usually there would be at least consideration of initiation or increase in treatment.
Future clinical trials may use the BILAG 2004 index, the SELENA flare instrument or a combination of the two to define flares. The main objective of the SLE International Collaborating Clinics flare study was to determine whether both these instruments can reliably identify flare when used by the same physicians to assess the same patients in a clinic setting. Secondary objectives were to determine whether or not these instruments can distinguish severe, moderate and mild flare from each other and from no flare. The present study compared flare defined by the BILAG 2004 index and the SELENA flare instrument using ‘live patients’, the majority of whom were experiencing flares of their disease at the time of the study. The study also incorporated a measure of the physician's global assessment (PGA) of flare and a panel assessment.
Sixteen patients (designated by letters from A to P) gave written informed consent to take part in the study, which was approved by the University College London Ethics Committee. Each patient met four or more of the revised classification criteria for SLE of the American College of Rheumatology.15 16 These patients were drawn from the lupus clinics held at University College Hospital London and St Thomas' Hospital, and were selected from patients seen in those clinics during the previous 3 weeks. Sixteen assessors (designated by numbers 1–16) from north America and Europe, all experienced rheumatologists with a special interest in SLE, volunteered to take part in the study. Each patient was seen for up to 1 h by four different physicians. Each physician assessed four different patients. The order in which patients were seen by the assessors was randomised according to four separate 4×4 Latin squares. Careful account was taken of assessors coming from both geographical locations to obtain as balanced a mix of assessors as possible.
A possible source of variability in the trial design described above was the fact that each group of four patients was seen by different assessors. It was therefore important to include a separate assessment of activity in which each of the 16 patients was seen by the same clinicians. To achieve this we used a panel of five lupus specialists, separate from the 16 assessors described above. Importantly, this group included clinicians from the host hospitals who knew each patient well. Each patient was thus assessed by two clinicians who knew him/her and three who did not. This panel of five reviewed the patients' histories and examined them together in the hour before the start of the assessment by assessors 1–16. By a consensus decision, this panel designated each patient into one of the following groups: severe flare, moderate flare, mild flare, no flare, according to the principles in the LFA definition.
On entering the clinic room occupied by the patient, the assessing physician was provided with the patient's history up until 1 month ago, current laboratory data (haematology, renal and immunological parameters) and sets of forms to enable them to complete the BILAG 2004 index, the SELENA flare instrument and a four-point PGA of flare (none, mild, moderate, severe). The LFA definition of flare was also provided. Detailed glossaries for these indices were available in each room and training in the use of the BILAG 2004 index and the SELENA flare instrument had been provided on the previous day. These indices are shown in supplementary material available online only. Each physician was given up to 1 h to examine the patient and complete the forms.
On leaving the room the physicians handed their forms to two other experienced physicians who checked to ensure that the forms had been filled in completely. The forms were passed next to a panel of three BILAG assessors (members of the BILAG) who converted the BILAG 2004 index assessments on the forms to the BILAG 2004 index A to E scores, as described elsewhere,8,–,10 and these scores were independently checked against the forms by CG.
It was agreed in advance that the BILAG 2004 index flare definition was that a severe flare will be any patient developing a BILAG 2004 ‘A’ score in any system due to items that are new or worse. A moderate flare would be considered in any patient deemed to have two or more ‘B’ scores due to items that are new or worse. A mild flare would be recorded in any patient with a single ‘B’ score due to items that are new or worse or in those with three or more ‘C’ scores due to items that are new or worse. Anyone without one of these criteria would be categorised as no flare.
The SELENA flare definition was determined by the highest category of the clinical features recorded on a system-based approach or by the highest treatment recommended by the physician for a system (the latter superseding the clinical scoring if there was discrepancy). The severe, moderate and minor flare definitions are set out in the introduction.
For the individual PGA, each physician having seen the patient was asked to tick a box on a sheet distinguishing severe/moderate/mild/no flare, in the knowledge of the flare definition proposed by the LFA as given above.
For each assessor (1–16) it was possible to compare his or her assessment of the presence and grade of flare in each patient with the opinion of the other three clinicians in the group of four who saw the same patients and with the opinion of the panel of five who saw all 16 patients. From logistic regression models, 95% CI for the percentage agreement were estimated. When percentages included multiple measurements by individual physicians, possible clustering was accounted for by including a random effect for physician in the regression analysis. A percentage agreement was also calculated between the BILAG 2004 and SELENA flare scores, the BILAG 2004 and physicians assessment scores and the SELENA flare scores and the physicians assessment. In addition, assessment of the internal reliability of the three forms of clinical assessment was based on the calculation, for the four groups combined, of intraclass correlation coefficients (ICC) with 95% CI.
Table 1 provides a detailed summary of the results of the study. For each patient the number of each type of flare assessment is given as well as the assessment of the panel. With respect to the assessment of any flare compares with no flare there was complete agreement by the four assessing physicians based on the BILAG 2004 index, for 13 (81%; 55% to 94%) patients, with the SELENA flare instrument for 12 (75%; 49% to 90%) patients and with PGA for 12 (75%; 49% to 90%) patients. With one exception, all these patients were assessed as having a flare by the panel. In addition (data not shown), all four assessing physicians agreed in the separation of no or mild flare from moderate or severe flare for 13 (81%; 55% to 94%) based on the BILAG 2004 index, for 10 (63%; 38% to 82%) patients using the SELENA flare instrument and for nine (56%; 32% to 78%) patients based on PGA.
For the one patient not assessed as having a flare by the panel, two physicians recorded flares using the BILAG 2004 index, two using the SELENA instrument and three physicians using PGA. For the four patients with a panel assessment of mild flare, three physicians made an assessment of no flare using the BILAG 2004 instrument, but none made this assessment with the SELENA instrument or PGA. For the six patients with a panel assessment of moderate flare, three physicians assessed no flare using each of the instruments. For the five patients with a panel assessment of severe flare, none, two and one physician assessed no flare with the three instruments, respectively.
For the four, six and five patients with panel assessments of mild, moderate and severe flare, there were four physician assessments for each patient leading to 16, 24 and 20 assessments, respectively, in total. For the BILAG 2004 index, disagreements with the panel on the type of flare occurred for eight of 16 for panel assessments of mild flare, 17 of 24 with moderate flare and one of 20 with severe flare. For the SELENA flare and PGA, the comparable numbers of disagreements for mild, moderate and severe flare were 11, 10, four and eight, nine, five, respectively. Table 2 gives the full distribution of physician assessments using the three instruments for each type of panel assessment.
Assessment of internal reliability of flare as determined by the BILAG 2004 index, SELENA flare and PGA was based on an ICC analysis that gives weight to the magnitude of the disagreements between physicians. The BILAG 2004 index flare was found to have the highest calculated ICC (95% CI) at 0.54 (0.32 to 0.78), compared with the SELENA flare instrument at 0.21 (0.08 to 0.48) and PGA at 0.18 (0.06 to 0.45).
Based on table 3, it can be seen that, for the 16 (patients)×four (physicians)=64 assessments of any flare versus no flare, the BILAG 2004 index and SELENA flare agreed for 52 of 64 (84%; 73% to 91%) assessments. There were 33 (52%; 39% to 63%) with precise agreement if the type of flare (including no flare) was taken into account. Similar comparisons for the BILAG 2004/PGA and SELENA flare/PGA results gave agreement rates of 85% (75% to 93%), 50% (38% to 62%) and 98% (90% to 99%), 82% (67% to 92%), respectively.
In order to understand some of the discrepancies further analysis was undertaken. One physician, in particular, and to a lesser extent two others, seemed quite at odds in their clinical judgement compared with their peers in general, and there was more disparity in clinical judgement for some patients than others for reasons that were not clear. As an example, in patient A, whose history was one of a recurrence of epileptic convulsions (three in the past month having had no convulsions for 6 months previously), there was a wide scatter of clinical assessment from no flare (two physicians using the SELENA flare and PGA) tools, mild flare using BILAG 2004 (two physicians) moderate flare (two physicians by SLEDAI and PGA and severe flare (one physician by BILAG 2004). Some disparity was also evident in certain patients when using the different tools. For example, patient K was judged by the panel of five to be having a severe flare and using the BILAG 2004 flare index tool all four physicians who saw the patient agreed. Surprisingly, when assessing the patient by the SELENA flare index only one physician placed the patient in the severe category, two designated no flare and one a moderate flare. This particular discrepancy may be explained by some confusion among the observers as to whether the patient's treatment had already been altered even within the last few days as opposed to an intention to alter the treatment. This would affect scoring by the SELENA instrument (in which treatment supersedes clinical scoring) but not in the BILAG 2004 index in which the scoring is predetermined for given items based on the principle of the physician's intention to treat.
To help determine the appropriateness of the assessment by the panel of five who saw all 16 patients, the real-life outcome of the patients in the ensuing month was reviewed subsequently from the patients' notes, and the results of this exercise are shown in table 4. This table also indicates which organs/systems were involved in each patient historically and in the current flare. Of the eight patients thought to have a severe flare by the panel, six had treatments that would be consistent with that assessment (B-cell depletion, intravenous methylprednisolone or high-dose oral corticosteroids). One had her dose of methotrexate increased, two had their steroids increased to 20 mg per day of prednisolone and mycophenolate was started, and the final patient was given a bolus of intramuscular steroids and tacrolimus cream. The single patient judged to have no flare had no change in therapy. Of the two patients judged to have a mild flare, one had no change in therapy and one had a modest increase in her prednisolone dose. Of the five patients judged to have a moderate flare, in one case the arthritis resolved spontaneously, in another, her dose of anti-convulsants was increased. In a third, the prednisolone was increased to 20 mg per day. In the fourth, the patient was admitted for intravenous methyl-prednisolone, and in the final patient no change in therapy was recorded.
The assessment of flare in patients with SLE remains a challenge. The definition of flare proposed by the LFA provides a useful basis to proceed, but more precise definitions are required for use in clinical trials. This study attempted to use disease activity ‘tools’ optimised to capture flare. In retrospect, we might have involved more non-flaring patients to increase the power of the study's ability to show that the ‘tools’ being assessed really do distinguish flare from no flare. However, this would have significantly increased the complexity of the arrangements, and our main focus was to study mild, moderate and severe flare. The physicians agreed that a flare was present in at least 75% of patients with a flare, irrespective of the methodology used and there was approximately 80% agreement between physicians recording flare using the BILAG 2004 index and the SELENA flare instrument. No flare and severe flare were easily distinguished but mild and moderate flare less easily so. This was evident both from the close agreement between panel assessment and real-life outcomes for patients with severe and no flare (but not mild or moderate flares) and from the comparisons of panel assessment with the assessments using BILAG, SELENA and PGA. Given the ongoing interest in major clinical trials in lupus,17 the need to identify the best method of capturing flare of all types remains paramount. Clearly, severe flares can be recorded and defined most objectively with the greatest agreement. These flares are the most important ones to determine due to the risk of severe disease activity causing damage.18
The ICC analysis showed better results for the BILAG 2004 defined flares with respect to consistency across physicians than the SELENA flare and PGA tools. This might be explained by greater familiarity with the BILAG 2004 index, which is more established and has already been shown to be reliable for assessing disease activity in the clinic setting.12 The poorer performance of the SELENA flare instrument may reflect the fact that it has only very recently been developed. Nevertheless, analysis of the results indicates that all three flare instruments are good at distinguishing severe flare and no flare. The BILAG 2004 index has the advantage that it can be used for monitoring disease response in both directions including detecting flares, and it has been demonstrated to be valid and sensitive to change in the assessment of disease activity.13 17 Problems arise primarily with respect to the capturing and defining of mild and moderate flare. In addition, problems clearly remain with respect to individual physician judgement and the analysis of some patients. It was hard, for example, to understand how a patient who gave a clear history of the recurrence of her convulsions could be judged by three physicians using BILAG 2004, one physician using the SLEDAI flare tool and one using the PGA tool to have had no increase in disease activity, unless there was a belief that the problem might have related more to damage than increasing activity. It is also possible that the medical history obtained over the course of four assessments may have differed between physicians. In this and in other ways the ‘live’ patient protocol used in the current study is more demanding than the use of ‘paper’ patients frequently employed in validation experiments.
Intent to treat seems to be more workable as a concept if defined, as it currently is, by the clinical features in the BILAG 2004 index, in which physicians have considered the level of treatment that would be reasonable ‘if all else were equal’ in determining the score for that clinical condition and level of severity. In the BILAG 2004 assessment, the actual treatment given on that visit does not trump the descriptor chosen as occurs in the SELENA flare index. This may be a critical distinction because changes in treatment can be either more or less aggressive in real life than would be applicable to the scoring of an individual organ or system, and may vary depending on what treatment the patient is already taking. Similarly, and especially in the context of a clinical trial protocol, patients may forego interventions while waiting for a study treatment to take effect or waiting for the end of the study in order to become eligible for open-label treatment (given the possibility that they may have received placebo). Both the SELENA flare instrument and the BILAG 2004 flare instrument allow a descriptor of more severity than the treatment allocated in real life to count as severe flare. In the case of the SELENA flare instrument this might require an exemption from the ‘treatment trumps’ rule; however, there is no doubt that clinical exceptions occur. A treatment might not be immediately increased, even for a significant flare when there is patient refusal, toxicity, infection, a slow-acting treatment that was recently begun, or in the context of a trial protocol. It is understood that although ‘intent to treat’ may be the most appropriate landmark for gauging a physician's assessment of flare, severity, mismatches in actual treatment will occur and should not always override a descriptor. We will be evaluating these hypotheses with these preliminary data and with data we will obtain from the larger scale studies currently being performed. Reviewing decisions in which there were obvious discrepancies, we will try to develop decision rules with explanatory notes to ‘govern’ such difficulties in the future.
The authors gratefully acknowledge the support of the Lupus Foundation of America, Merck Serono, Roche, Wyeth and Novo Nordisk, which made this study possible.
Funding This study was funded by the Lupus Foundation of America, Merck Serono, Roche, Wyeth and Novo Nordisk. INB is supported by the Manchester Academic Health Sciences Centre and the Manchester NIHR Biomedical Research Centre.
Competing interests None.
Ethics approval This study was conducted with the approval of the University College Hospital Medical Ethics Committee.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.