Article Text

Download PDFPDF

On using machine learning algorithms to define clinically meaningful patient subgroups
  1. Iago Pinal-Fernandez,
  2. Andrew Lee Mammen
  1. Muscle Disease Unit, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, Maryland, USA
  1. Correspondence to Dr Iago Pinal-Fernandez, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Muscle Disease Unit. Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, MD 20892, USA; pinalfernandei{at}nih.gov; Dr Andrew Lee Mammen, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Muscle Disease Unit. Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, MD, United States; andrew.mammen{at}nih.gov

Statistics from Altmetric.com

We have read with interest the study from Spielmann et al in which three distinct subgroups of patients with anti-Ku autoantibodies were identified.1 To accomplish this, the authors acquired large amounts of clinical data from a heterogeneous cohort of anti-Ku-positive patients. Next, they used a factorial map to reduce the dimensions of their data and then applied the Ward agglomerative hierarchical clustering method to the resulting dimensions. Subsequently, to determine the number of clusters of anti-Ku patients, they used FactoMineR to determine the elbow point in the inertia graph.2 This revealed three distinct clusters of anti-Ku-positive patients with statistically significant differences in clinical features.

We are concerned that the approach used by Spielmann and colleagues to determine the number of clusters (ie, the number of clinically meaningful groups of anti-Ku patients) may be flawed. Using the same methodology, we can show that apparent clusters will be identified even in a data set does not contain meaningful clusters. For example, we generated a data set with 1000 observations and four variables following a multivariate normal distribution (a generalisation of the one-dimensional normal distribution to higher dimensions). Although this randomly generated data set contains no meaningful clusters, three clusters were still obtained by application of the same methodology used by the authors (figure 1). Importantly, as the observations of each cluster are similar, the features of each cluster are guaranteed to be significantly different provided that the sample size is sufficient.

Figure 1

Machine learning algorithms define clusters in data sets that do not contain meaningful clusters. A data set with a multivariate normal distribution with 1000 observations and four variables was generated. The Ward agglomerative hierarchical clustering algorithm was applied to this data set and three clusters were defined (coloured in green, red and black). Panel A shows the hierarchical clustering, panel B the hierarchical clustering projected on the factor map, panel C the uncolored factor map and panel D the coloured factor map.

As this example illustrates, hierarchical clustering-derived techniques (like the elbow method included in the FactoMineR package)2 are highly unreliable to detect the number of clusters in a complex data set. In fact, determining the number of clusters is a machine learning problem that has no optimal solution.3 Consequently, this approach cannot be used to reliably determine the number of clinically meaningful clusters of patients within a heterogeneous patient population.

It is noteworthy that Spielman and colleagues are not alone in using hierarchical clustering methods to try and identify meaningful patient subgroups. For example, Mariampillai et al used a similar clustering approach to identify the number of myositis subgroups within a large heterogeneous patient population. The authors subsequently proposed a new myositis classification criteria based on the number of patient subgroups defined by their algorithm.4

Although software packages make the application of machine learning methods increasingly easy for experts and non-experts alike, understanding the limitations of these approaches is critical. Studies that take advantage of machine learning methods may be fundamentally flawed if a cornerstone of the analysis depends on the incorrect use of a complex biostatistical technique.

References

View Abstract

Footnotes

  • Contributors IP-F and ALM wrote the letter together. IP-F performed the simulation.

  • Funding This research was supported in part by the Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases of the National Institutes of Health.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; internally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles