Should we always choose a nonparametric test when comparing two apparently nonnormal distributions?

doi:10.1016/S0895-4356(00)00264-X

Journal of Clinical Epidemiology

Volume 54, Issue 1, January 2001, Pages 86-92

https://doi.org/10.1016/S0895-4356(00)00264-X Get rights and content

Abstract

When clinical data are subjected to statistical analysis, a common question is how to choose an appropriate significance test. Comparing two independent groups with observations measured on a continuous scale, the question is typically whether to choose the two-sample-t test or the Wilcoxon–Mann–Whitney test (WMW test). Similar results are often obtained, but which conclusion can be drawn if significance tests give highly different P-values? The t test is optimal for normally distributed observations with common variance and robust to deviations from normality if sample sizes are not very small. The WMW test makes no distributional assumptions, but depends heavily on equal shape and variance of the two distributions (homoscedasticity). We have compared the properties of the traditional two-sample t test, a modified t test allowing unequal variance, and the WMW test by stochastic simulation. All show acceptable behaviour when the two distributions have similar variance. When variances differ, the modified t test is superior to the other two.

Introduction

An appropriate statistical analysis of clinical data demands choosing an adequate statistical method. One decision that has to be made is whether to employ a parametric or a nonparametric method. The methods are based on different assumptions, and if these are violated, the statistical analysis may lead to erroneous conclusions (i.e., P-values). The parametric methods make assumptions regarding the shape of the distribution of observations, whereas one is made to believe that nonparametric methods do not make such assumptions. The latter are therefore often referred to as distribution free.

If the assumptions made in the model formulation are incompatible with the data being analysed, the true significance level may be far from the nominal level that was specified. The common choice of nominal level is 5%. Obviously, we do not want to use a method with a higher probability of a false-positive result than planned. It may at first sight seem less obvious that a reduced level is a disadvantage, but a conservative test will usually lead to loss of power and the risk of drawing a false-negative conclusion will thus increase. In order to maintain the desired significance level and power of a test, it is therefore important to choose a test that applies to the problem at hand.

Clinical studies often compare the means of two independent groups of patients. If the observations are measured on a continuous scale, the statistical analysis is usually performed by the two-sample t test or the Wilcoxon–Mann–Whitney (WMW) test. There is no uniform agreement as to the choice between them, but the traditional recommendation seems to be a t test if the observations seem to be reasonably normally distributed, or if the number of patients in each sample is large, whereas the WMW test is recommended if sample sizes are small and the distributions seem to be skew.

Unfortunately, the choice between methods is not necessarily straightforward. Simple two-sample tests, whether parametric or not, make one important assumption: the two distributions that are compared are assumed to have the same shape and variance. This is referred to as a pure shift model or homoscedasticity. If the variances of the two distributions are not equal or if the distributions have different shape, this assumption is violated. Tests allowing unequal variance of two normal distributions were developed more than 60 years ago [1]. It has previously been shown that the WMW test and the t-test can have true significance levels that differ substantially from the nominal levels when two population variances are not equal 2, 3, 4, 5, 6, 7. This work has mostly been published in statistical journals. Some interest in the subject has been demonstrated in behavioural sciences [8], and the recommendations regarding the choice between tests based on power under certain distributional assumptions have been published (e.g., in Ref. [9]). The problem of heteroscedasticity (unequal shape or variance) seems generally not to be acknowledged in the medical literature and has received little attention in textbooks of applied statistics.

The aim of this study is to compare the properties of commonly used statistical tests. We have restricted the comparison to tests implemented in standard software: the two different versions of the t test (assuming equal and unequal variances, respectively) and the WMW test. Their properties are demonstrated in several situations that are typically met when clinical data are analysed. Various combinations of sample sizes, as well as shapes and variance of distributions are examined. The test properties are compared by stochastic simulation. A guide to choosing an appropriate test is also given.

Section snippets

A clinical example

After treatment with high-dose chemotherapy (HDT) a total of 35 patients with malignant lymphoma received peripheral blood progenitor cells (PBPC) mobilised with MIME/G-CSF [10]. Ten patients had Hodgkin's disease and 25 had non-Hodgkin's lymphoma. Time to neutrophil recovery was defined as the time from reinfusion of stem cells to the number of neutrophils exceeds 0.5 × 10⁹/L. Fig. 1 shows the time to neutrophil recovery in each diagnosis group.

Table 1 shows the results of different

Definition of the tests

To test the hypothesis of equality of the means of two distributions, the two-sample t test is applied. Let x_i denote the observations in Group A and y _j the observations in Group B. The number of observations in each group is m and n, respectively. Then

$x$ = $∑ i=1 m$ $x_{i} m$

and

$y$ = $∑ j=1 n$ $y_{j} n$

are the estimated means of the two distributions, and

$s$ _x²= $∑ i=1 m$ $x_{i} − x^{2} m−1$

and

$s$ _y²= $∑ j=1 n$ $y_{j} − y^{2} n−1$

the estimated variances.

The t test is based on the statistic $t= x − y s_{x}^{2} m −1 + s_{y}^{2} n −1 m + n −2 1 m + 1 n$ and is known to be the best test if

Different models

Fig. 2A shows a pure shift model of two normal distributions. The distributions have the same variance; only the means differ. For this model, the two-sample t test is known to be the best test. Fig. 2B shows a pure shift model where distributions are skew with a heavy right tail (gamma distributions with shape parameter a=3). For situations similar to the one illustrated in Fig. 2B, the WMW is recommended if sample sizes are small.

Fig. 2C shows a situation with two normal distributions that do

Simulations

The properties of the three tests have been examined by stochastic simulation. The simulation programs were written in SIMULA [11] and executed on a SUN computer at the University of Oslo.

In the simulation program independent samples were drawn from two distributions with the same shape and mean, but possibly different variance. The ratio between the variances was varied between 1/9 and 9. In terms of S.D. this corresponds to 1/3 and 3. The parameter of importance is this ratio; the actual

Sample sizes equal, m=n=10

Fig. 3 shows estimated significance levels for the three tests when the observations are sampled from normal distributions. For graphical purposes, the S.D. ratio is presented rather than the ratio between variances. All three tests obtain the nominal (desired) level when variances are equal. Welch's U test, developed for situations with unequal variances, maintains the nominal level (0.05) throughout. The t test and the WMW test, however, have somewhat higher significance levels than desired

Choosing an appropriate test

A guide to choosing an appropriate test is given in Table 2. The effect of differences in variances is much more striking than the sensitivity to different types of distribution.

Based on the numerical studies above, recommendations can be given as to which P-value to report in the comparison of time to neutrophil recovery in Hodgkin's and non-Hodgkin's lymphoma. It has been demonstrated that the Welch U test is a better test than the other two when both shapes and variances differ. The

Discussion

Two slightly different versions of significance tests that allow unequal variances were proposed by Welch [1]. It has previously been shown that the properties of the so-called Welch's V test are marginally better than those of the U test 2, 12. Nevertheless, we have chosen to present the U test here, as the U test is implemented in most standard statistical software and therefore used in practice.

Simulations have been performed with a number of different distributions. In addition to the

References (12)

P.D. Bridge et al.
Increasing physicians' awareness of the impact of statistics on research outcomescomparative power of the t-test and Wilcoxon rank-sum test in small samples applied research
J Clin Epidemiol
(1999)
B.L. Welch
The significance of the difference between two means when the population variances are unequal
Biometrika
(1937)
U. Chand
Distributions related to comparison of two means and two regression coefficients
Ann Math Stat
(1950)
G.B. Wetherill
The Wilcoxon test and non-null hypotheses
J Roy Statist Soc (Series B)
(1960)
H.R. Van der Vaart
On the robustness of Wilcoxon's two-sample test
J.W. Pratt
Robustness of some procedures for the two-sample location problem
JASA
(1964)

There are more references available in the full text version of this article.

Cited by (127)

Investigation of the barriers to and functional outcomes of telerehabilitation in patients with hand injury
2024, Journal of Hand Therapy
Telerehabilitation is an approach that is growing in importance and rapidly becoming more prevalent. However, the potential barriers to this approach and its effectiveness relative to face-to-face treatment still need to be determined.
The aim of this study was to investigate the technology and access barriers, implementation and organizational challenges, and communication barriers faced by patients undergoing postoperative telerehabilitation after hand tendon repair surgery. It also aimed to investigate the effect of telerehabilitation on pain, kinesiophobia, and functional outcomes.
Prospective, open-label, nonrandomized comparative clinical study.
The study was conducted with 44 patients who underwent tendon repair surgery due to tendon injuries of the extrinsic muscles of the hand. Participants were divided into two groups (face-to-face group and telerehabilitation group). All participants received three physiotherapy sessions per week for 8 weeks from their surgery (via video conference using mobile phones to the telerehabilitation group). An early passive motion protocol was applied for flexor tendon and zone 5-7 extensor tendon repairs. Mallet finger rehabilitation was performed for zone 2 extensor tendon repairs, while an early active short arc approach was used for zone 3-4 repairs. The telerehabilitation and face-to-face groups received the same treatment protocols three times a week. In the eighth week of treatment (in the 24th session), the Turkish version of the Arm, Shoulder, and Hand Injury Questionnaire (DASH-T) and Tampa Scale for Kinesiophobia were administered to all patients. The telerehabilitation group also underwent a barrier questionnaire. A pretreatment assessment could not be conducted. The independent-sample t-test was used for DASH-T data, and the Mann-Whitney U-test was used for Tampa Scale for Kinesiophobia to compare groups.
In the study, there were 24 participants (age: 31.58 ± 12.02 years) in the face-to-face group and 20 participants (age: 39.25 ± 12.72 years) in the telerehabilitation group. The two groups were similar in terms of DASH-T and pain (p = 0.103, effect size = 0.647, and p = 0.086, effect size = 0.652, respectively) in the 8 weeks. However, the telerehabilitation group had a higher fear of movement (p = 0.017, effect size = 3.265). The most common barriers to telerehabilitation practices were the fear of damaging the tendon repair and the need for help during the treatment.
We determined that face-to-face treatment in acute physiotherapy for patients who have undergone tendon repair may be more effective compared to telerehabilitation, as it appears to be less likely to induce kinesiophobia. However, in situations where face-to-face treatment is not possible (such as lockdown), telerehabilitation can also be preferred after at least one in-person session to teach and perform exercises.
Built and social indices for hazards in Children's environments
2023, Health and Place
Leveraging the capabilities of the Historical Spatial Data Infrastructure (HSDI) and composite indices we explore the importance of children's built and social environments on health. We apply contemporary GIS methods to a set of 2000 historical school records contextualized within an existing HSDI to establish seven variables measuring the relative quality of each child's built and social environments. We then combined these variables to create a composite index that assesses acute (short-term) health risks generated by their environments. Our results show that higher acute index values significantly correlated with higher presence of disease in the home. Further, higher income significantly correlated with lower acute index values, indicating that the relative quality of children's environments in our study area were constrained by familial wealth. This work demonstrates the importance of analyzing multiple activity spaces when assessing built and social environments, as well as the importance of spatial microdata.
Interpretable trading pattern designed for machine learning applications[Formula presented]
2023, Machine Learning with Applications
Financial markets are a source of non-stationary multidimensional time series which has been drawing attention for decades. Each financial instrument has its specific changing-over-time properties, making its analysis a complex task. Hence, improvement of understanding and development of more informative, generalisable market representations are essential for the successful operation in financial markets, including risk assessment, diversification, trading, and order execution.
In this study, we propose a volume-price-based market representation for making financial time series more suitable for machine learning pipelines. We use a statistical approach for evaluating the representation. Through the research questions, we investigate, i) whether the proposed representation allows any improvement over the baseline (always-positive) performance; ii) whether the proposed representation leads to increased performance over the price levels market pattern; iii) whether the proposed representation performs better on the liquid markets, and iv) whether SHAP feature interactions are reliable to be used in the considered setting.
Our analysis shows that the proposed volume-based method allows successful classification of the financial time series patterns, and also leads to better classification performance than the price levels-based method, excelling specifically on more liquid financial instruments. Finally, we propose an approach for obtaining feature interactions directly from tree-based models and compare the outcomes to those of the SHAP method. This results in the significant similarity between the two methods, hence we claim that SHAP feature interactions are reliable to be used in the setting of financial markets.
Natural copper isotopic abnormity in maternal serum at early pregnancy associated to risk of spontaneous preterm birth
2022, Science of the Total Environment
Spontaneous preterm birth (SPB) has drawn public attention due to its increasing incidence and adverse effects on fetal growth. Effect of copper (Cu) imbalance in maternal bodies on the risk of SPB remains a subject of debate, and the related mechanisms are still unraveled. Here we applied natural stable copper isotopes to explore the underlying association and mechanism of copper imbalance with SPB using a nested case-control study. We collected maternal sera at the early pregnancy stage and then measured their copper isotopic ratio (⁶⁵Cu/⁶³Cu, expressed as δ⁶⁵Cu) as well as physiological and biochemical indexes from women with and without delivering SPB. We found that SPB cases had no significant difference in serum copper level from their controls, but their serum copper was significantly isotopically heavier than the controls (δ⁶⁵Cu value = 0.15 ± 0.34 ‰ versus −0.15 ± 0.17 ‰, P = 0.0149). Compared with the controls with lower δ⁶⁵Cu values, the crude odds ratio (OR) associated with SPB risk increased to 4.00 (95 % confidence interval (CI): 1.37–11.70) and the adjusted OR reached up to 11.35 (95 % CI: 1.35–95.60). Furthermore, via the copper isotopic fractionation, we revealed that dietary intake and blood ceruloplasmin may play more important roles than blood lipids and mother-to-child transmission in the copper imbalance associated with SPB. Further studies will be needed to understand the mechanisms of isotope fractionation related to reproductive health.
Sensory processing in young children with visual impairments: Use and extension of the Sensory Profile
2022, Research in Developmental Disabilities
Children with visual impairments (VI) are at risk for sensory processing difficulties. A widely used measure for sensory processing is the Sensory Profile (SP). However, the SP requires adaptation to accommodate for how children with VI experience sensory information.
(1) To examine sensory processing patterns in young children with VI, (2) to develop VI-specific items to use in conjunction with the SP and to determine internal consistency and construct validity of these newly developed items, and (3) to examine the association between sensory processing and and emotional and behavioral problems.
Twenty-six VI-specific items were added to the SP. The SP and these items were completed by caregivers of 90 children with VI between 3 and 8 years old. The Child Behavior Checklist (CBCL) was used to assess emotional and behavioral problems.
Three- to five-year-old children with VI have significantly more difficulties in three quadrants of the SP as compared to the norm group. Six- to eight-year-old children with VI have more difficulties in all quadrants. A reliable and valid VI-specific set of 15 items was established following psychometric evaluation. Age-related differences were found in the associations between the SP and CBCL.
Although further validation is recommended, this evaluation of the VI-specific item set suggests it has the potential to be a useful measure for children with VI.
Comprehending the impact of deep learning algorithms on optimizing for recurring impediments associated with stress prediction using ECG data through statistical analysis
2022, Biomedical Signal Processing and Control
Despite the myriad of stress related research studies, there has been very few studies which focused on the complexity of the ECG signal/data, prior to predicting stress. In order to counter the problem of “data complexity and overfitting”, we innovated ML (machine learning) approaches using transfer learning and autoencoder techniques, in order to predict stress/not stress (2 classes) with high precision from WESAD dataset. We then assessed the bias and variance associated with our algorithms through various statistical tests, in order to understand their ability to generalize well on newer data. Our proposed algorithms were able to achieve 98.99% (CNN) and 98.92% (VGG16) accuracy through 10-fold cross validation, while maintaining a very low bias, variance, akaike information criterion (AIC) of and Bayesian information criterion (BIC) scores, substantiating their ability to predict stress with very high accuracy without overfitting. Results illustrates their ability to generalize well on any stress related data, irrespective of their complexity. Our algorithms performed better than every other related studies. Although we were able to reduce the time and space complexity of the algorithms through transfer learning and autoencoder techniques, the algorithms still require more time and computational power than simpler algorithms, something which will require more attention for our future work.

View all citing articles on Scopus

View full text

Original articlesShould we always choose a nonparametric test when comparing two apparently nonnormal distributions?

Abstract

Introduction

Section snippets

A clinical example

Definition of the tests

Different models

Simulations

Sample sizes equal, m=n=10

Choosing an appropriate test

Discussion

J Clin Epidemiol

The significance of the difference between two means when the population variances are unequal

Biometrika

Distributions related to comparison of two means and two regression coefficients

Ann Math Stat

The Wilcoxon test and non-null hypotheses

J Roy Statist Soc (Series B)

On the robustness of Wilcoxon's two-sample test

Robustness of some procedures for the two-sample location problem

JASA

Original articles
Should we always choose a nonparametric test when comparing two apparently nonnormal distributions?