Original articlesShould we always choose a nonparametric test when comparing two apparently nonnormal distributions?
Introduction
An appropriate statistical analysis of clinical data demands choosing an adequate statistical method. One decision that has to be made is whether to employ a parametric or a nonparametric method. The methods are based on different assumptions, and if these are violated, the statistical analysis may lead to erroneous conclusions (i.e., P-values). The parametric methods make assumptions regarding the shape of the distribution of observations, whereas one is made to believe that nonparametric methods do not make such assumptions. The latter are therefore often referred to as distribution free.
If the assumptions made in the model formulation are incompatible with the data being analysed, the true significance level may be far from the nominal level that was specified. The common choice of nominal level is 5%. Obviously, we do not want to use a method with a higher probability of a false-positive result than planned. It may at first sight seem less obvious that a reduced level is a disadvantage, but a conservative test will usually lead to loss of power and the risk of drawing a false-negative conclusion will thus increase. In order to maintain the desired significance level and power of a test, it is therefore important to choose a test that applies to the problem at hand.
Clinical studies often compare the means of two independent groups of patients. If the observations are measured on a continuous scale, the statistical analysis is usually performed by the two-sample t test or the Wilcoxon–Mann–Whitney (WMW) test. There is no uniform agreement as to the choice between them, but the traditional recommendation seems to be a t test if the observations seem to be reasonably normally distributed, or if the number of patients in each sample is large, whereas the WMW test is recommended if sample sizes are small and the distributions seem to be skew.
Unfortunately, the choice between methods is not necessarily straightforward. Simple two-sample tests, whether parametric or not, make one important assumption: the two distributions that are compared are assumed to have the same shape and variance. This is referred to as a pure shift model or homoscedasticity. If the variances of the two distributions are not equal or if the distributions have different shape, this assumption is violated. Tests allowing unequal variance of two normal distributions were developed more than 60 years ago [1]. It has previously been shown that the WMW test and the t-test can have true significance levels that differ substantially from the nominal levels when two population variances are not equal 2, 3, 4, 5, 6, 7. This work has mostly been published in statistical journals. Some interest in the subject has been demonstrated in behavioural sciences [8], and the recommendations regarding the choice between tests based on power under certain distributional assumptions have been published (e.g., in Ref. [9]). The problem of heteroscedasticity (unequal shape or variance) seems generally not to be acknowledged in the medical literature and has received little attention in textbooks of applied statistics.
The aim of this study is to compare the properties of commonly used statistical tests. We have restricted the comparison to tests implemented in standard software: the two different versions of the t test (assuming equal and unequal variances, respectively) and the WMW test. Their properties are demonstrated in several situations that are typically met when clinical data are analysed. Various combinations of sample sizes, as well as shapes and variance of distributions are examined. The test properties are compared by stochastic simulation. A guide to choosing an appropriate test is also given.
Section snippets
A clinical example
After treatment with high-dose chemotherapy (HDT) a total of 35 patients with malignant lymphoma received peripheral blood progenitor cells (PBPC) mobilised with MIME/G-CSF [10]. Ten patients had Hodgkin's disease and 25 had non-Hodgkin's lymphoma. Time to neutrophil recovery was defined as the time from reinfusion of stem cells to the number of neutrophils exceeds 0.5 × 109/L. Fig. 1 shows the time to neutrophil recovery in each diagnosis group.
Table 1 shows the results of different
Definition of the tests
To test the hypothesis of equality of the means of two distributions, the two-sample t test is applied. Let xi denote the observations in Group A and y j the observations in Group B. The number of observations in each group is m and n, respectively. Then
=
and
=
are the estimated means of the two distributions, and
x2=
and
y2=
the estimated variances.
The t test is based on the statistic and is known to be the best test if
Different models
Fig. 2A shows a pure shift model of two normal distributions. The distributions have the same variance; only the means differ. For this model, the two-sample t test is known to be the best test. Fig. 2B shows a pure shift model where distributions are skew with a heavy right tail (gamma distributions with shape parameter a=3). For situations similar to the one illustrated in Fig. 2B, the WMW is recommended if sample sizes are small.
Fig. 2C shows a situation with two normal distributions that do
Simulations
The properties of the three tests have been examined by stochastic simulation. The simulation programs were written in SIMULA [11] and executed on a SUN computer at the University of Oslo.
In the simulation program independent samples were drawn from two distributions with the same shape and mean, but possibly different variance. The ratio between the variances was varied between 1/9 and 9. In terms of S.D. this corresponds to 1/3 and 3. The parameter of importance is this ratio; the actual
Sample sizes equal, m=n=10
Fig. 3 shows estimated significance levels for the three tests when the observations are sampled from normal distributions. For graphical purposes, the S.D. ratio is presented rather than the ratio between variances. All three tests obtain the nominal (desired) level when variances are equal. Welch's U test, developed for situations with unequal variances, maintains the nominal level (0.05) throughout. The t test and the WMW test, however, have somewhat higher significance levels than desired
Choosing an appropriate test
A guide to choosing an appropriate test is given in Table 2. The effect of differences in variances is much more striking than the sensitivity to different types of distribution.
Based on the numerical studies above, recommendations can be given as to which P-value to report in the comparison of time to neutrophil recovery in Hodgkin's and non-Hodgkin's lymphoma. It has been demonstrated that the Welch U test is a better test than the other two when both shapes and variances differ. The
Discussion
Two slightly different versions of significance tests that allow unequal variances were proposed by Welch [1]. It has previously been shown that the properties of the so-called Welch's V test are marginally better than those of the U test 2, 12. Nevertheless, we have chosen to present the U test here, as the U test is implemented in most standard statistical software and therefore used in practice.
Simulations have been performed with a number of different distributions. In addition to the
References (12)
- et al.
Increasing physicians' awareness of the impact of statistics on research outcomescomparative power of the t-test and Wilcoxon rank-sum test in small samples applied research
J Clin Epidemiol
(1999) The significance of the difference between two means when the population variances are unequal
Biometrika
(1937)Distributions related to comparison of two means and two regression coefficients
Ann Math Stat
(1950)The Wilcoxon test and non-null hypotheses
J Roy Statist Soc (Series B)
(1960)On the robustness of Wilcoxon's two-sample test
Robustness of some procedures for the two-sample location problem
JASA
(1964)
Cited by (127)
Investigation of the barriers to and functional outcomes of telerehabilitation in patients with hand injury
2024, Journal of Hand TherapyBuilt and social indices for hazards in Children's environments
2023, Health and PlaceInterpretable trading pattern designed for machine learning applications[Formula presented]
2023, Machine Learning with ApplicationsNatural copper isotopic abnormity in maternal serum at early pregnancy associated to risk of spontaneous preterm birth
2022, Science of the Total EnvironmentSensory processing in young children with visual impairments: Use and extension of the Sensory Profile
2022, Research in Developmental DisabilitiesComprehending the impact of deep learning algorithms on optimizing for recurring impediments associated with stress prediction using ECG data through statistical analysis
2022, Biomedical Signal Processing and Control