Article Text

## Abstract

To increase change between groups, randomised clinical trials (RCT) often include patients with high risk for a particular outcome, by inclusion criteria that select predictors for that outcome. This increases the statistical power, and fewer patients are required for that RCT. The way in which patient selection influences the power, and thus sample size required, depends on how an intervention reduces the individual risk: by an absolute or relative risk reduction model.

- ARR, absolute risk reduction
- CsA, ciclosporin A, DMARDs, disease modifying antirheumatic drugs
- MTX, methotrexate
- RA, rheumatoid arthritis
- RCT, randomised controlled trial
- RRR, relative risk reduction
- SSZ, sulfasalazine

- rheumatoid arthritis
- radiography
- joint damage
- sample size
- risk reduction
- randomised clinical trials

## Statistics from Altmetric.com

- ARR, absolute risk reduction
- CsA, ciclosporin A, DMARDs, disease modifying antirheumatic drugs
- MTX, methotrexate
- RA, rheumatoid arthritis
- RCT, randomised controlled trial
- RRR, relative risk reduction
- SSZ, sulfasalazine

To increase the contrast between an intervention and a control group, randomised clinical trials (RCTs) often include patients with a high prior risk for the outcome of interest by selecting according to baseline predictors for that outcome. The general view is that such a strategy limits the number of patients required, by increasing the statistical power. In RCTs in rheumatoid arthritis (RA), one important outcome is progression of radiological damage. It is known that joint damage present at baseline (prior risk) is one of the strongest predictors for further progression of radiological joint damage.^{1–}^{4} Therefore, we expected fewer patients would be needed for an RCT if patients were selected for the presence of joint damage at baseline. Somewhat surprisingly, power calculations showed that this expectation was false. To elucidate this counterintuitive observation, we performed a literature search, and found that statistical power is determined by a critical relationship between the prior risk on a particular dichotomous outcome (this is the baseline risk, independent of treatment), and on how a particular treatment exerts its effect with respect to this prior risk. We will explain this relationship step by step, and illustrate it with hypothetical and authentic data.

## ABSOLUTE AND RELATIVE RISK REDUCTION

The risk of an individual patient for a particular outcome can be reduced in two different ways, according to the following two models: the absolute risk reduction (ARR) model, and the relative risk reduction (RRR) model. To understand these models it is necessary to be familiar with the terms ARR and RRR.

Consider an RCT comparing an active treatment with placebo. The outcome of interest is binomial (*event* or *no event*) and the active treatment can reduce the probability of that event. So, the event rate in the treatment group (P_{t}) is supposed to be lower than the event rate in the placebo group (P_{p}). ARR is the difference in the event rates between the placebo and the treatment groups (P_{p}−P_{t}). Relative risk is defined here as the ratio of the event rate in the treatment group and that in the placebo group (P_{t}/P_{p}). RRR is defined as the reduction of the event rate in the treatment group, in proportion to that in the placebo group, or: (P_{p}−P_{t})/P_{p}. This is mathematically similar to (1 – the relative risk)—that is, (1 – P_{t}/P_{p}). RRR is usually expressed as a percentage—that is, (1−(P_{t}/P_{p})×100%).

If a treatment works according to an RRR model, the RRR remains constant, irrespective of the prior risk, whereas the ARR varies with the prior risk. If a treatment works according to an ARR model, the ARR remains constant irrespective of prior risk and then the RRR varies with different prior risks.

## APPLICATION OF RRR AND ARR IN THREE SUBGROUPS

Suppose that we distinguish three subgroups, defined by the prior risk on a particular outcome: one group with a low prior risk, one group with an intermediate prior risk, and one group with a high prior risk. Note that prior risk does not directly refer to the event rate in the control group, but rather to patient characteristics (such as age, sex, disease aetiology, concomitant conditions, or disease status) measured at baseline, and known for their ability to influence the probability of the outcome. Suppose also that without any (adequate) treatment the outcome will occur in 20%, 60%, and 90% of the patients in each group, respectively. Assume that treatment works purely according to the RRR model (table 1.1), providing an RRR of 50%. So, treatment will lower the event rate from 20% to 10% (50% reduction) in the low risk group, from 60% to 30% (50% reduction) in the intermediate risk group, and from 90% to 45% (50% reduction) in the high risk group. The ARR is now 10%, 30%, and 45%, respectively. These figures can be used to calculate estimated sample sizes for future clinical trials. We calculated the samples sizes for two sided statistical testing using the power calculator of the UCLA department of statistics with the two sample arcsine approximation of the binomial distribution, with α set at 0.05 and β at 0.20.^{5} From table 1.1 (sample sizes in each group) it is obvious that the baseline risk importantly influences the appropriate sample size. In other words: in this scenario, selecting patients at high risk will increase the statistical power of a trial.

Now assume that a treatment works purely according to the ARR model (table 1.2), providing an ARR of 10%. Based on the same prior risk percentages, treatment will lower the event rate from 20% to 10% (10% reduction) in the low risk group, from 60% to 50% (10% reduction) in the intermediate risk group, and from 90% to 80% (10% reduction) in the high risk group. The RRR is now 50% ((1– (10%/20%))×100) in the low risk group, 17% ((1 – (50%/60%))×100) in the intermediate risk group, and 11% ((1 – (80%/90%))×100) in the high risk group, respectively. If again sample sizes are estimated using these figures, it becomes clear that the same sample sizes are required in low and high risk groups to statistically demonstrate an ARR of 10%. But to demonstrate an ARR of 10% for the intermediate risk group, a much larger sample size is needed. This can be explained by the fact that for intermediate risk groups the probability of the occurrence of a negative outcome is almost similar to the probability of the occurrence of a positive outcome. Note that an intermediate risk group may either include patients with an intermediate prior risk only, or a mix of patients with a low, high, and intermediate risk.

To summarise: depending on how a treatment reduces a person’s risk of a particular negative outcome (according to an ARR or an RRR model), the patient selection for the trial with respect to the prior risk may influence the statistical power, and accordingly the sample size required for that trial. If a treatment (mainly) acts according to the RRR model, trials including patients with prognostic variables for a negative outcome (high prior risk) will yield more statistical power, and require a lower sample size. If a treatment acts mainly according to the ARR model, selection of patients according to prognostic variables has a different effect on sample size. From a statistical point of view, it would be wise to avoid groups of patients with an average prior risk on the outcome of approximately 50%. Trials with such patient groups provide less statistical power than trials with patient groups with a lower or a higher prior risk.

In the sparse literature that have evaluated the relationship between treatment effect and the prior risk, it is suggested that the RRR is constant across the usual spectrum of prior risks in the vast majority of treatments (that is, following the RRR model).^{6,}^{7} Nevertheless exceptions have been described. A large study evaluating the stroke risk in patients treated with aspirin, showed a decreasing RRR by increasing prior risk.^{8} However, the ARR also decreased by increasing baseline risk, so actually a mix of both models seemed to be operative. Obviously, treatments do not always act strictly according the RRR or to the ARR model: mixed models can also be found.

## PRIOR RISKS, MODELS OF RISK REDUCTION, AND TREATMENT EFFECT

To our best knowledge, in rheumatology no research has been performed to examine the relationship of prior risks and models of risk reduction with treatment effect. Recently, radiographic data of an RCT comparing methotrexate (MTX) + ciclosporin A (CsA) versus CsA monotherapy have become available.^{9} We used these data to investigate this relationship. Treatment effect was defined here as the reduction in the proportion of patients with the outcome—namely, radiographic progression greater than or equal to the median group level, at 1 year. The prior risk for radiographic progression (baseline risk) was based on the radiological damage at baseline, above or below the median. Table 1.3 shows that the RRR was not similar in the two baseline risk groups (50% in the low risk group *v* 37% in the high risk group): The addition of MTX to CsA appeared not to follow a pure RRR model, but rather an ARR model (the ARR was approximately similar in both risk groups (25% *v* 29%)).

We further explored the radiographic data of the COBRA trial (table 1.4), in order to confirm the phenomenon of a decreasing RRR by an increasing baseline risk. In the COBRA trial,^{10} combination treatment with step down prednisolone, MTX, and sulfasalazine (SSZ) was compared with SSZ monotherapy. We again defined treatment effect as the reduction in the number of patients with radiographic progression above the median level in 1 year.

Table 1.4 shows that the ARR (combination treatment, compared with SSZ alone) was similar in the two risk groups (22% *v* 25%), and that the RRR decreased by increasing baseline risk (54% *v* 30%).

We statistically compared the treatment effects in both subgroups of baseline damage, in order to test the null hypothesis that the relative risk reductions were similar (test of interaction) as recommended by Matthews and Altman, and Altman and Bland.^{11,}^{12} In neither of the studies could a significant interaction be demonstrated (p = 0.29 for the difference in treatment effects in the MTX + CsA study, and p = 0.17 for the difference in treatment effect in the COBRA study). For the test of interaction, as well as for an interpretation and discussion of the lack of statistical interaction, we refer to appendix 1.

These observations cannot of course be generalised to the effect of all disease modifying antirheumatic drugs (DMARDs) on radiological joint damage or on other outcome measures. However, if it is true that some DMARDs decrease radiological progression by an ARR rather than an RRR model, selection of patients with an average prior risk on radiological progression of about 50% should be avoided when an RCT with such a DMARD is carried out (table 1.2). But only a proportion of all patients with RA who eventually show radiological progression will have radiological joint damage at inclusion. This makes it difficult to selectively enrol patients with a high baseline risk, and the actual patient accrual may include patients with intermediate (∼50%) rather than with high baseline risk.

## FURTHER COMMENTS

We should like to make a few additional remarks. Firstly, a person who is developing a trial has to make a choice between aiming at a mixture of high, intermediate, and low risk patients, and focusing on just one category. For generalisability one may choose to include patients at all types of risk. However, we showed here that this might lead to larger sample sizes. On the other hand, one should consider whether the preferred inclusion of high risk patients is feasible. If high risk patients are difficult to include for any reason, the argument of an appropriate recruitment rate may outweigh the argument of limited sample sizes by the selective inclusion of high risk patients.

Patient selection in RCTs is often based on characteristics that are predictive of a certain outcome. The aim of this report was partly to show that statistical power is dependent on the level of that prior risk, as well as on how treatment actually reduces that risk. This is a different approach from selecting patients on the individual likelihood of *responding* to a particular treatment. Selecting patients with a high probability of responding to a particular treatment will also result in a larger treatment effect and thereby an increasing ARR and RRR. This means more statistical power, and as a consequence, a smaller required sample size. Increased responsiveness at a group level in RCTs can be promoted in two different ways.^{13} The first is by taking measures that result in a general increase of a patient’s compliance. Those who take their medicine (appropriately) might respond better than those who do not take their medicine (appropriately). The second way to promote responsiveness is by identifying subgroups of patients who are intrinsically responsive to the particular treatment. But patient characteristics that are predictive of a high response to any treatment have hardly be identified in rheumatology up to now. This will be an important research area in the future.

Our final remark refers to the models used to describe the relationship between the prior risk and the treatment effect. These models are based on discrete binomial outcome measures. The primary outcome measure of a trial can also be a continuous variable. Although the models cannot simply be translated to continuous measures, there are no arguments as to why treatment effects should act differently when measured on a continuous scale as compared with a dichotomous one.

We here conclude that patient selection according to factors predictive of the outcome that should be influenced by treatment has an impact on the statistical power of RCTs with dichotomous outcome measures. The precise direction of that impact depends not only on the level of prior risk but also on whether the ARR or the RRR remains constant irrespective of the prior risk. As a rule of thumb, statistical power is best guaranteed by selecting high risk patients, because this scenario omits the dependency of the type of risk reduction. Better insight into prediction of individual responsiveness may further increase statistical power and decrease required sample size.

## APPENDIX I

The statistical proof of whether an ARR or RRR model is operative is difficult and circumstantial. The statistical inference refers to the comparison of treatment effects across subgroups of an RCT. The null hypothesis is that the difference of relative risks in both subgroups is zero (the so-called test of interaction).

Treatment effects are represented here as relative risks (for example, the risk of having radiographic progression above the median when treated with CsA + MTX in relation to that risk when treated with CsA + placebo). The two subgroups are the low baseline risk group and the high baseline risk group.

In an ARR model, the absolute risk reductions are similar in both subgroups, whereas the relative risk reductions in both subgroups are different. The latter effect refers to interaction, and can be tested statistically, as shown below. The equivalence of absolute risk reductions is difficult to prove. Theoretically, the proof of different relative risk reductions does not suffice, because this does not prove that absolute risk reductions in both subgroups are similar.

Here we show the test of interaction performed in both clinical trials that we have used to corroborate the argument of the ARR model in our article. Tables 2.1 and 2.2 show the raw data of both trials, which are also shown in the article, but now tabulated in a different order, in order to improve readability with respect to the inferences below. Note that we have calculated the relative risk here as a starting point for the inferences. We have also added, as an illustration, the absolute and relative risk reductions calculated in the article.

Table 3 shows the statistical inferences necessary to test for interaction in both trials. The inference is complicated because of the logarithmic transformation (and the necessary back-transformation). The test of interaction is not statistically significant in either of the trials, indicating that the null hypothesis that both relative risk reductions are similar cannot be rejected (*or*: it is not proved that the relative risk reductions in both subgroups are different). Because the subgroup sizes are small (lack of statistical power), because a test of interaction is quite conservative, and because the difference of relative risk reductions is considerably higher than the difference of absolute risk reductions in both trials, the failure to prove interaction here does not disqualify our statement that an AAR rather than an RRR model is operative here. (Note in table 3 that the ratio of relative risks (sometimes abbreviated as RRR) is not similar to the relative risk reduction in this article (abbreviated by us as RRR).)