Article Text

Download PDFPDF

On how to not misuse hierarchical clustering on principal components to define clinically meaningful patient subgroups. Response to: ‘On using machine learning algorithms to define clinical meaningful patient subgroups’ by Pinal-Fernandez and Mammen
  1. Alain Meyer1,2,
  2. Lionel Spielmann3,
  3. François Séverac4,5
  1. 1 Exploration Fonctionnelle Musculaire, Service de physiologie, Hôpitaux Universitaires de Strasbourg, Strasbourg, France
  2. 2 Centre National de Référence des Maladies Auto-Immunes Systémiques Rares de l'Est et du Sud-Ouest, Service de rhumatologie, Hôpitaux Universitaires de Strasbourg, Strasbourg, France
  3. 3 Service de Rhumatologie, Hôpitaux Civils de Colmar, Colmar, France
  4. 4 Service de Santé Publique, GMRC, CHU de Strasbourg, Strasbourg, France
  5. 5 iCUBE, UMR 7357, équipe IMAGeS, Université de Strasbourg, Strasbourg, France
  1. Correspondence to Dr Lionel Spielmann, Service de Rhumatologie, Hospices Civils de Colmar, Colmar 68024, Alsace (Région), France; lionel.spielmann{at}ch-colmar.fr

Statistics from Altmetric.com

We thank Pinal-Fernandez and Mammen for their interesting methodological comment on our work in which we used hierarchical clustering on principal components to define clinically meaningful subgroups of patients with anti-Ku antibodies.1 2

We fully agree with the conclusion of the authors: ‘machine learning methods may be fundamentally flawed if a cornerstone of the analysis depends upon the incorrect use of a complex biostatistical technique’.

In this regard, the example of hierarchical clustering on principal components they provide in their comment is an illustration on how this statistical tool can be misused and generate false discoveries:

  1. First, hierarchical clustering on principal components is a descriptive method that is fitted to describe heterogeneous datasets. Prior reports from the literature indicate that patients with anti-Ku did represent a heterogeneous condition. On the contrary, in their example, Pinal-Fernandez and Mammen (voluntarily) used a multivariate normal distribution that is perfectly unfitted for hierarchical clustering on principal components.

  2. Second, the suitability of the dataset for clustering analysis on principal components can be further tested statistically, for example, using the Bartlett's test of sphericity as performed in our study. Bartlett’s test is an objective benchmark which is used to minimise the danger of interpreting factor analytic results which can be attributed entirely to chance.3 In accordance with the first point, in our dataset from anti-Ku patients, the p value was <0.001, indicating that hierarchical clustering on principal components was very likely to be useful. On the contrary, we ran the Bartlett's test of sphericity on 1000 datasets proposed by Pinal-Fernandez and Mammen and found that the p value was >0.05 in 95% of cases (which was expected since variables were simulated independently).

  3. Third, principal components analysis (PCA) enables to distinguish structure from noise contained in the data. The number of dimensions retained to determine the principal components of the dataset is a crucial step of the analysis.4 In this perspective, we used K-fold cross-validation and retained the number of dimensions leading to the smallest mean square error of prediction (MSEP). On the contrary, in their example, Pinal-Fernandez and Mammen arbitrarily fixed the number of dimensions to two. Yet, in their dataset, PCA produces four components which equally explain ~25% of the variance of the data because there is no structure in the dataset. Thus, the K-fold cross-validation applied on their dataset indicated that the number of dimensions leading to the smallest MSEP is zero, which means that there is no principal component in their dataset, but only noise. In such case, hierarchical clustering must not be undertaken.

As illustrated by the more that 5000 original articles referenced on PubMed using this method, hierarchical clustering does represent a powerful tool to describe datasets.

As smartly illustrated by Pinal-Fernandez and Mammen, several rules must be followed in order to not misuse hierarchical clustering on principal components.

References

View Abstract

Footnotes

  • Handling editor Prof Josef S Smolen

  • Correction notice This article has been corrected since it published Online First. The title has been amended.

  • Contributors AM supervised the project. AM and LS wrote the manuscript with support from FS. FS verified the analytical method. AM, LS and FS contributed to the final version of the manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests No, there are no competing interests for any author.

  • Patient consent for publication Not required.

  • Provenance and peer review Commissioned; internally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles