Article Text

## Abstract

It is quite common to investigate multiple hypotheses in a single study. For example, a researcher may want to investigate the effect on several outcome variables or at different time points, compare more than two groups or undertake separate analyses for subgroups. This increases the probability of type I errors. Different procedures for multiplicity adjustment are available to control the probability of type I errors. In the present article, we describe some methods for multiplicity adjustment, along with recommendations.

- Epidemiology
- Economics
- Machine Learning

## Statistics from Altmetric.com

## Adjustment methods based on p values

Consider a study where six hypothesis tests are conducted. If all tests are made at a significance level of 5%, each of them will have a 5% probability of making a type I error, that is, erroneously rejecting the null hypothesis. The probability of a type I error in at least one of the hypothesis tests, also referred to as the familywise error rate (FWER), will then be substantially higher than 5%, and at worst almost 30%. Often it is desirable to keep the FWER within a predefined threshold, for example, 5%.

The simplest adjustment method is called Bonferroni correction: each p value is multiplied by the number of null hypotheses tested, in this case, six. Each resulting ‘Bonferroni-adjusted’ p value can then be used to decide whether the corresponding null hypothesis can be rejected at the chosen significance level (usually, 5%). However, the Bonferroni correction is very conservative, which means that the statistical power, and thereby the probability of declaring true alternative hypotheses, will be greatly reduced. By using the Šidák correction, only a marginal improvement is achieved. Alternative methods, in order of increasing statistical power, are Holm’s step-down correction, Hochberg’s step-up correction and the Hommel correction.1 These alternative methods preserve the FWER for one-sided tests under some additional, but rather general assumptions, and can be generally recommended.

In some situations, a very large number of hypotheses are tested. For example, genetics studies may involve several hundred thousand hypotheses, making it impossible to control the FWER. Instead, we have to content ourselves with controlling for the false discovery rate (FDR). This means that we allow for a certain proportion, usually 5%, of the hypotheses that we claim as true, to be false positives. When controlling for the FWER, on the other hand, we would not ‘accept’ even a single false-positive finding. The most common method for controlling the FDR is called the Benjamini-Hochberg correction.2 Controlling for the FDR can also be relevant in studies with as few as, for example, 8–16 hypothesis tests, although its benefits are greater for testing a larger number of hypotheses.2

Let us look at an (imaginary) study that tested six hypotheses. Suppose the p values are as shown in table 1, ordered by increasing size. The effect of adjustment is practically the same across the different methods for the smallest p value. With subsequent (larger) p values, the adjustment effect of the methods has less impact as we move from left to right. The final column with p values adjusted with the Benjamini-Hochberg correction controls only for the FDR. With only six hypothesis tests, a method preserving FWER would usually be preferred.

## Pairwise comparisons between groups

In some studies, the researcher wants to compare three or more groups. This could, for example, be a randomised controlled trial including several treatments. Then, it is often relevant to conduct pairwise comparisons between the groups.

If the study includes three groups—A, B and C—up to three pairwise comparisons can be conducted. And, if the study includes four groups—A, B, C and D—up to six pairwise comparisons are possible: A–B, A–C, A–D, B–C, B–D and C–D. In order to control the FWER, one could calculate a p value for each of the pairwise comparisons and then adjust these p values using one of the above listed methods. But there exist methods that take the pairwise structure of the hypotheses into account, and that have substantially higher statistical power. But this is not straightforward. An overview in Kirk3 lists a total of 16 different recommended methods for different sets of assumptions. For example, if the data are normally distributed, Tukey’s test is recommended for all pairwise comparisons, or Dunnett’s test for comparisons with only a control group. But even this recommendation is valid only for groups of approximately equal size and equal variance. Choosing the most appropriate method for more than three groups can be difficult, even when the data are normally distributed.

## Comparisons between three groups

If the study includes only three groups, as is often the case, there is a much simpler procedure that does not rely on any assumptions concerning, for example, distribution or group size: first, the global p value is calculated for the null hypotheses that all three groups are identical. If this p value is significant, an unadjusted p value is calculated separately for each of the three pairwise comparisons, and the pairs with p value below the (unadjusted) significance level are declared significantly different. If it is desired to report actual p values, each of these three p values is adjusted by replacing it with the global p value if the global p value is higher.4 This procedure always controls for the FWER,5 but many researchers seem to be unaware of this fact. Even when the data are normally distributed and Tukey’s test could be used, this simple method will give a statistical power at least as high as Tukey’s test for three groups.6

If the data are normally distributed, we can estimate the global p value in a one-way analysis of variance and then make pairwise comparisons with t-tests. If non-parametric methods are used, we can first perform a global Kruskal-Wallis test, followed by pairwise Wilcoxon-Mann-Whitney tests. And, if the data are categorical, we can first perform Pearson’s χ^{2} test for three groups and then Pearson’s χ^{2} tests for each of the three pairwise comparisons. It must be emphasised that the method described is restricted to three groups. For example, when three different treatment groups are compared with a control group, four groups are involved, and this procedure will not control for the FWER.

## Always adjust?

Do we always need to adjust for multiple hypotheses? This is somewhat controversial. The famous epidemiologist Rothman argues against adjusting for multiplicity in some contexts.7 To put this into relief: imagine a researcher who studies the effect of a treatment on three different outcome variables. Does he need to adjust for multiplicity if he splits the results into three different publications with only one hypothesis in each? Or should he perhaps adjust for all the hypotheses that he has tested during his career? There are some alternatives to adjustment. In a study with several outcome variables, it is normal to specify which one is the primary one. Hypothesis tests are performed without adjusting, but in any findings ‘less weight’ is placed on secondary outcome variables. In other situations, it may be relevant to choose a pragmatic solution, such as setting the significance level at 1%, rather than 5%. This will give some protection against false positives, but usually without reducing statistical power as much as a formal adjustment would have done.

The European Agency for the Evaluation of Medicinal Products8 and the US Food and Drug Administration9 have issued guidelines for handling multiplicity in clinical trials. These guidelines describe situations where no formal adjustment is needed, as well as other situations where a formal adjustment should be carried out. For example, if the treatment effect needs to be present for two or more primary outcome variables in order to be declared effective, no formal adjustment is needed.

In some settings, there is no general consensus regarding whether, and if so, how, one should adjust for multiple hypotheses. In any case, the choice of procedure must be specified in advance in the protocol or analysis plan in order to avoid ‘fishing’ for significant findings. And the number of statistical tests involved must be clearly stated.

## Ethics statements

### Patient consent for publication

## Acknowledgments

This article is partly based on these two earlier works by the author:

Lydersen S. Adjustment of p values for multiple hypotheses. Journal of the Norwegian Medical Association 2021;141(13) doi: 10.4045/tidsskr.21.0357

Lydersen S. Pairwise comparisons of three groups. Journal of The Norwegian Medical Association 2021;141(14) doi: 10.4045/tidsskr.21.0359.

## Footnotes

Handling editor Josef S Smolen

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.