Objective: To test whether rheumatoid arthritis (RA) trials treatment efficacy versus control is better detected for patients with lower tender joint counts (TJC) or swollen joint counts (SJC) than for higher counts.
Methods: Using data from six large multicentre trials (N = 2002) and an intent-to-treat approach at 6 months, two subtrials were created within each trial, the lower disease activity group (defined by TJC less than overall median) and the higher disease activity group. The same approach was used for SJC. Active treatment was tested for treatment control differences using several RA trial outcome measures: ACR20, EULAR response, ACRHybrid. Sample sizes needed for higher TJC and SJC RA trials versus lower TJC and SJC trials were compared.
Results: Subtrials of subjects with lower TJC were found to have much higher sensitivity to change than those of subjects with higher TJC across all trials and outcome measures. A trial with lower TJC patients would require a smaller sample size than those with higher TJC patients. Results were not consistent for SJC subgroups. Three reasons were found for sensitivity to change of lower TJC: compared with higher TJC, those with lower TJC showed greater response to active treatment. Subjects with higher TJC on control treatment had greater percentage improvement and more variable responses than those in the lower TJC group.
Conclusions: In RA trials, patients with lower disease activity within the range of current trial eligibility are more likely to show treatment efficacy than patients with higher disease activity. Lowering thresholds especially for TJC in trials may make it easier to detect treatment effects in RA.
Statistics from Altmetric.com
In trials of treatments for rheumatoid arthritis (RA), it is customary for high levels of tender joint counts (TJC) and swollen joint counts (SJC) to be required for patient eligibility. Most trials require at least 10 tender joints and eight swollen joints, and the mean number of active joints in patients enrolled is far higher than this. The disease activity of patients with RA in the USA and in some western European countries has fallen1 and, at least as measured by disease activity scores (DAS), is now quite low on average.2 Disease activity has thus fallen to a point at which most RA patients are probably not eligible for most trials testing treatment. Increasingly, trial patients originate from South America, eastern Europe and elsewhere, where evidence suggests2 that the disease on average is still very active. If new treatments are tested outside of the USA and western Europe because of the absence of high disease activity patients, this may suggest that new treatments might not be generalisable to those with lower disease activity. Furthermore, because trials provide the central evidence on treatment efficacy, our failure to test treatments in patients such as those we see in practice in the USA and western Europe raises concerns about whether this evidence is relevant to our patients.
One major assumption behind the high threshold for eligibility in trials is that effective treatments suppress very active disease and individuals with more active disease would be likely to improve more than those with less disease activity, and therefore they would be better subjects for trials. However, the relative response of individuals with higher compared with those with lower disease activity has not been examined in trials of patients with RA. At any rate, the efficacy of treatment in a trial is tested not by the response of patients to active treatment, but by the comparative response of patients compared with placebo or control treatment. The efficacy of treatment in a trial is a function both of the response to active treatment and the response to placebo. So, even if subjects with lower disease activity are less likely to respond to active treatment, they may also be less likely to show a response to placebo, making their comparative response profile similar to, or even better than, a subject with greater disease activity.
With these considerations in mind and using a dataset of multiple large randomised trials in RA, we tested whether treatment efficacy in RA trials would be better detected if trials were conducted in patients with higher disease activity compared with if they were conducted of patients with lower disease activity.
The datasets used were made available through the effort of the American College of Rheumatology (ACR) to redefine and re-evaluate response criteria for RA, an effort that has led to the promulgation of the hybrid ACR measure.3 Eleven multicentre randomised trials were used in that effort, most of them of tumour necrosis factor alpha inhibitors with some of conventional disease-modifying antirheumatic drugs.
To define lower and higher disease activity, we identified the overall median SJC across all trials, we then created two subtrials for each trial, one limited to those subjects with higher disease activity (top 50%) and the other limited to those subjects with lower disease activity (bottom 50%). We alternatively defined higher disease activity as patients with higher TJC (at or above the overall median for all trials) and patients with lower TJC (below the overall median for all trials). Because some of these trials did not have sufficient numbers of subjects in both the higher and lower disease activity subtrials to make a valid comparison of treatments, we restricted our analyses of higher versus lower disease activity to trials with at least 200 subjects, sufficient numbers in our view for the evaluation of higher versus lower disease activity subgroups. In our 11 datasets, six trials met these criteria. Agreements with the industry sources that provided these data prohibit us from identifying specific trials.
To test sensitivity to change, we used three different candidate measures of response in RA trials that are either widely used or that permitted us readily to test the relative sensitivity for the detection of treatment effects: the ACR20, EULAR good response (present if the DAS in 28 joints is 3.2 or less and change in DAS or DAS in 28 joints is greater than 1.2 from baseline) and the ACRHybrid. The first two are dichotomous measures (yes/no for each patient) and the third is a continuous measure of improvement taking on values between 0 (no improvement) and 1 (100% improvement in core set measures). To evaluate the sensitivity to change of higher versus lower disease activity, we compared active versus control in each of the subtrials using an intent-to-treat approach at 6 months of follow-up with a Student’s t-test (for ACRHybrid) or χ2 test (ACR20 and EULAR good response) to compare treatments. We translated these findings to estimates of sample sizes needed to undertake clinical trials conducted in the lower and higher disease activity subtrials.4–6 For the dichotomous outcome measures (ACR20 and EULAR good response), to estimate sample size, we used a hypothetical response rate of 30% in the control group and the observed odds ratio of treatment response from the clinical trials for treatment versus control. For the ACRHybrid we used a difference of 30% (Δ = 0.3) between treatment and control and the observed standard deviation from the clinical trials. All sample size estimates are based on tests with two-sided α = 0.05 and 80% power. We tried additional assumptions for control response rate and treatment control differences and the results were similar.
To understand our results better, we investigated for each of the higher and lower disease activity subtrials response in the active treatment and placebo groups and the variation around that response. For example, if the higher disease activity groups needed a larger sample size, we anticipated two explanations: the difference in response between active and placebo treatments would be greater in that disease activity subgroup and/or the variability in response would be greater in the placebo group and less in the active treatment group. The difference and the variation would thus determine the sample size requirement. The effect size, which equals the difference between treatments divided by the pooled standard deviation of the differences, is an excellent tool to measure the effect of a treatment compared with control. We used it to compare the efficacy of treatment compared with control in each subtrial within each trial to get a sense of the consistency of our findings across trials.
Of the six multicentre trials, two were trials of tumour necrosis factor alpha inhibitor compared with placebo, two were trials of conventional second-line drugs compared with placebo and two compared combinations of drugs compared with single drugs. All were reported as positive for either the active treatment or the combinations of treatments being tested.
All of these trials excluded patients with joint counts below a certain threshold, a threshold that varied by trial. The minimum TJC permitted ranged from 5 to 12 and the minimum SJC ranged from 3 to 10 and agreed with the trial protocols. We normalised joint counts to 68 tender and 66 swollen (some trials had potential joint counts less than this). Using the 68 joint TJC and a 66 joint SJC, median counts for patients in trials were 27 tender joints and 18 swollen joints. Subtrials were created with subjects below and above these normalised medians in each trial.
When we divided subjects in the trials according to the median TJC (see table 1), we found that, regardless of how we assessed the outcome, trials of those with TJC below the median required smaller sample sizes than those with counts at or above the median. For example, if we used the ACRHybrid as our outcome measure, restricting trials to those with TJC below the median, a trial with 80% power to detect a treatment difference of 0.30 on the ACRHybrid score would require a sample size per treatment arm of 30, whereas for patients with higher TJC, the sample size requirement would be 53 per treatment arm (see table 2). We found similar results when we used dichotomous outcomes—EULAR good response and ACR20. For all the outcomes we tested, trials had greater sensitivity to change and required smaller sample sizes if patients with lower TJC were studied.
We explored why lower TJC increased sensitivity to change by examining treatment and control responses in each of the trials. For the ACRHybrid, a continuous measure of response, we found that the control group with higher TJC showed a greater median response than the control group with lower TJC for five out of six trials (see fig 1), and for active treatment groups, the median treatment response was actually higher for those with lower TJC in five out of six trials.
We calculated the effect size of treatment in each subtrial, and in four out of six trials the effect sizes of the lower were greater than those of the higher TJC subgroups and in one of the other two, effect size was almost identical (table 3).
For ACR20 and EULAR good response, response rates for active treatment were higher in the lower than the higher TJC group in most trials (see fig 2). This created a larger difference in response rate comparing subjects on active treatment with control subjects.
For SJC, the findings were by no means as clear (see table 3). The sensitivity to change did not differ greatly in those with SJC above versus below the median. For EULAR good response, sample size requirements were slightly higher for the subgroup with lower SJC, whereas for ACR20, sample size needs were slightly greater for the subgroup with higher counts. There were no consistent trends in trials of lower versus higher SJC (see figs 3 and 4). In some subgroups, patients with higher SJC had a greater active treatment response rate, whereas in others, those with lower SJC did better on active treatment. Control group responses also varied across trials.
In an analysis of multiple RA trials, we found that if trials were restricted to patients with lower TJC, treatments could be more easily detected as statistically significantly superior to placebo or to a comparator than if trials were restricted to those with higher TJC. We did not find the same trend with SJC. The reasons behind this result include a higher response rate to the control treatment (usually placebo) in individuals with higher TJC and greater variability of response for the placebo-treated patients in those with higher TJC. Even the response of the active treatment group tended to be greater among individuals with lower TJC, regardless of whether this response was assessed as a percentage change in disease activity (eg, ACRHybrid or ACR20) or as absolute change in activity (EULAR good).
Although patients with higher disease activity might be expected to have a greater absolute response in their disease activity when treated with active treatments, this was not consistently true. Furthermore, we suggest that those with lower disease activity may be better candidates for trials. The reason is not intuitively obvious and does not necessarily derive from their better or worse response to active treatment. Rather, it derives partly from the placebo (or control) group’s response in these situations. For ACRHybrid and ACR20, placebo groups tended to show greater response if patients started with higher disease activity. That makes it harder in higher disease activity patients to distinguish between active treatment and placebo. On the other hand, among individuals with lower disease activity, placebo responses were worse (four out of six trials for ACR20) and treatment responses were often better.
For a patient to achieve a EULAR good response, they must experience both an absolute decrease in the DAS and reach a low DAS threshold. EULAR good response rates were uniformly higher in those with lower TJC than those with higher counts (see fig 2B) regardless of whether they were in the active treatment or placebo groups (fig 2B). Because patients with lower joint counts start closer to that threshold than those with higher counts, the high EULAR good response rates in those with baseline lower joint counts may reflect this. The net decrease in counts was greater in active treatment than placebo, and the consequence was that, for EULAR good response also, sample size requirements were less for those with lower TJC than for those with higher counts.
One reason underlying our results may be regression to the mean. Patients with higher disease activity may not consistently have such higher levels of disease activity, but rather may enter a trial when they are at an apex in terms of their disease activity level. Their natural course may be to regress to their own mean and have lower disease activity. This will occur whether they are treated with placebo or active treatment and will make it hard to detect the added effect of active treatment on their improvement. The variability of placebo response suggests that some patients with higher disease activity are experiencing regression to the mean whereas others are not.
Why do our results differ for swollen and tender joints? Swollen joints are more stable and vary less with improvement.7 That may make them less susceptible to regression to the mean.
We also explored whether the subgroup analyses were valid, whether differences observed between treatment and control in the two subgroups were due purely to sample size differences (we chose the overall median count but not the trial-specific median), and whether the initial randomisations were preserved to make the statistical comparison valid. Even though we used an overall median count for all trials, there were almost equal numbers of subjects in each trial in the TJC higher versus lower subgroups and in the SJC higher versus lower subgroups. Furthermore, the treatment-control assignment ratios almost matched the original assignment ratios, showing little evidence of violating randomisation.
In an era when RA in the USA is becoming milder (perhaps due to better treatments), our results may have important implications for both practice and the design and conduct of RA trials. Our results suggest that recruiting more patients with milder disease (defined as TJC in the lower end of current inclusion criteria) will make it easier to detect treatment effects.
The major limitation in our ability to address the relation of low joint counts to treatment response definitively is that trials that we analysed were performed with restrictive inclusion criteria, prohibiting us from testing the sensitivity to change of treatment versus placebo in patients with substantially lower disease activity than was present in these trials. Furthermore, even though we took large trials and divided them in half, because of relatively small numbers, we are unable to subdivide this activity level into smaller increments to extrapolate our results to even lower disease activity levels, such as TJC less than six.
In conclusion, RA trials would probably be more efficient in detecting the efficacy of treatments if they included patients with lower, not higher, disease activity, especially lower TJC. This intuitively paradoxical finding is based on the high variability in placebo response in those with higher disease activity and to a higher placebo response rate in this group. However, even response to active treatment is as robust in those with lower joint counts as in those with higher counts.
Competing interests: None.
Funding: This study was supported by NIH AR47785 and by a grant from the American College of Rheumatology to re-evaluate response criteria in rheumatoid arthritis.
Please address reprint requests to Dr B Zhang at Suite 200, 650 Albany Street, Boston University School of Medicine, Boston, MA 02118, USA; firstname.lastname@example.org.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.