Article Text
Statistics from Altmetric.com
We have read with interest the study from Spielmann et al in which three distinct subgroups of patients with anti-Ku autoantibodies were identified.1 To accomplish this, the authors acquired large amounts of clinical data from a heterogeneous cohort of anti-Ku-positive patients. Next, they used a factorial map to reduce the dimensions of their data and then applied the Ward agglomerative hierarchical clustering method to the resulting dimensions. Subsequently, to determine the number of clusters of anti-Ku patients, they used FactoMineR to determine the elbow point in the inertia graph.2 This revealed three distinct clusters of anti-Ku-positive patients with statistically significant differences in clinical features.
We are concerned that the approach used by Spielmann and colleagues to determine the number of clusters (ie, the number of clinically meaningful groups of anti-Ku patients) may be flawed. Using the same methodology, we can show that apparent clusters will be identified even in a data set does not contain meaningful clusters. For example, we generated a data set with 1000 observations and four variables following a multivariate normal distribution (a generalisation of the one-dimensional normal distribution to higher dimensions). Although this randomly generated data set contains no meaningful clusters, three clusters were still obtained by application of the same methodology used by the authors (figure 1). Importantly, as the observations of each cluster are similar, the features of each cluster are guaranteed to be significantly different provided that the sample size is sufficient.
Machine learning algorithms define clusters in data sets that do not contain meaningful clusters. A data set with a multivariate normal distribution with 1000 observations and four variables was generated. The Ward agglomerative hierarchical clustering algorithm was applied to this data set and three clusters were defined (coloured in green, red and black). Panel A shows the hierarchical clustering, panel B the hierarchical clustering projected on the factor map, panel C the uncolored factor map and panel D the coloured factor map.
As this example illustrates, hierarchical clustering-derived techniques (like the elbow method included in the FactoMineR package)2 are highly unreliable to detect the number of clusters in a complex data set. In fact, determining the number of clusters is a machine learning problem that has no optimal solution.3 Consequently, this approach cannot be used to reliably determine the number of clinically meaningful clusters of patients within a heterogeneous patient population.
It is noteworthy that Spielman and colleagues are not alone in using hierarchical clustering methods to try and identify meaningful patient subgroups. For example, Mariampillai et al used a similar clustering approach to identify the number of myositis subgroups within a large heterogeneous patient population. The authors subsequently proposed a new myositis classification criteria based on the number of patient subgroups defined by their algorithm.4
Although software packages make the application of machine learning methods increasingly easy for experts and non-experts alike, understanding the limitations of these approaches is critical. Studies that take advantage of machine learning methods may be fundamentally flawed if a cornerstone of the analysis depends on the incorrect use of a complex biostatistical technique.
Footnotes
Contributors IP-F and ALM wrote the letter together. IP-F performed the simulation.
Funding This research was supported in part by the Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases of the National Institutes of Health.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; internally peer reviewed.
Linked Articles
- Correspondence response
- Correspondence response