On using machine learning algorithms to define clinically meaningful patient subgroups

Iago Pinal-Fernandez; Andrew Lee Mammen

doi:10.1136/annrheumdis-2019-215852

Article Text

PDF

Correspondence

On using machine learning algorithms to define clinically meaningful patient subgroups

Free

http://orcid.org/0000-0001-6338-9218Iago Pinal-Fernandez,
http://orcid.org/0000-0003-3732-3252Andrew Lee Mammen

Muscle Disease Unit, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, Maryland, USA

Correspondence to Dr Iago Pinal-Fernandez, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Muscle Disease Unit. Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, MD 20892, USA; pinalfernandei{at}nih.gov; Dr Andrew Lee Mammen, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Muscle Disease Unit. Laboratory of Muscle Stem Cells and Gene Expression, Bethesda, MD, United States; andrew.mammen{at}nih.gov

https://doi.org/10.1136/annrheumdis-2019-215852

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

We have read with interest the study from Spielmann et al in which three distinct subgroups of patients with anti-Ku autoantibodies were identified.1 To accomplish this, the authors acquired large amounts of clinical data from a heterogeneous cohort of anti-Ku-positive patients. Next, they used a factorial map to reduce the dimensions of their data and then applied the Ward agglomerative hierarchical clustering method to the resulting dimensions. Subsequently, to determine the number of clusters of anti-Ku patients, they used FactoMineR to determine the elbow point in the inertia graph.2 This revealed three distinct clusters of anti-Ku-positive patients with statistically significant differences in clinical features.

We are concerned that the approach used by Spielmann and colleagues to determine the number of clusters (ie, the number of clinically meaningful groups of anti-Ku patients) may be flawed. Using the same methodology, we can show that apparent clusters will be identified even in a data set does not contain meaningful clusters. For example, we generated a data set with 1000 observations and four variables following a multivariate normal distribution (a generalisation of the one-dimensional normal distribution to higher dimensions). Although this randomly generated data set contains no meaningful clusters, three clusters were still obtained by application of the same methodology used by the authors (figure 1). Importantly, as the observations of each cluster are similar, the features of each cluster are guaranteed to be significantly different provided that the sample size is sufficient.

Figure 1

Machine learning algorithms define clusters in data sets that do not contain meaningful clusters. A data set with a multivariate normal distribution with 1000 observations and four variables was generated. The Ward agglomerative hierarchical clustering algorithm was applied to this data set and three clusters were defined (coloured in green, red and black). Panel A shows the hierarchical clustering, panel B the hierarchical clustering projected on the factor map, panel C the uncolored factor map and panel D the coloured factor map.

As this example illustrates, hierarchical clustering-derived techniques (like the elbow method included in the FactoMineR package)2 are highly unreliable to detect the number of clusters in a complex data set. In fact, determining the number of clusters is a machine learning problem that has no optimal solution.3 Consequently, this approach cannot be used to reliably determine the number of clinically meaningful clusters of patients within a heterogeneous patient population.

It is noteworthy that Spielman and colleagues are not alone in using hierarchical clustering methods to try and identify meaningful patient subgroups. For example, Mariampillai et al used a similar clustering approach to identify the number of myositis subgroups within a large heterogeneous patient population. The authors subsequently proposed a new myositis classification criteria based on the number of patient subgroups defined by their algorithm.4

Although software packages make the application of machine learning methods increasingly easy for experts and non-experts alike, understanding the limitations of these approaches is critical. Studies that take advantage of machine learning methods may be fundamentally flawed if a cornerstone of the analysis depends on the incorrect use of a complex biostatistical technique.

References

↵
. [Epub ahead of print: 24 May 2019 Spielmann L , Nespola B , Séverac F , et al . Anti-Ku syndrome with elevated CK and anti-Ku syndrome with anti-dsDNA are two distinct entities with different outcomes. Ann Rheum Dis 2019;78:1101–6.doi:10.1136/annrheumdis-2018-214439 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31126956
OpenUrl Abstract/FREE Full Text
↵
2. Lê S ,
3. Josse J ,
4. Husson F
. FactoMineR: an R package for multivariate analysis 2008;25.doi:10.18637/jss.v025.i01
↵
2. Uv L ,
3. Williamson RC ,
4. Guyon I
. Clustering: Science or Art? In: Isabelle G , Gideon D , Vincent L , et al. , eds. Proceedings of ICML workshop on unsupervised and transfer learning. Proceedings of Machine Learning Research: PMLR, 2012: 65–79.
↵
2. Mariampillai K ,
3. Granger B ,
4. Amelin D , et al
. Development of a new classification system for idiopathic inflammatory myopathies based on clinical manifestations and myositis-specific autoantibodies. JAMA Neurol 2018;75:1528–37.doi:10.1001/jamaneurol.2018.2598
OpenUrl

Footnotes

Contributors IP-F and ALM wrote the letter together. IP-F performed the simulation.
Funding This research was supported in part by the Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases of the National Institutes of Health.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; internally peer reviewed.

Linked Articles

Correspondence response
On how to not misuse hierarchical clustering on principal components to define clinically meaningful patient subgroups. Response to: ‘On using machine learning algorithms to define clinical meaningful patient subgroups’ by Pinal-Fernandez and Mammen

Alain Meyer Lionel Spielmann François Séverac
Annals of the Rheumatic Diseases 2019; 79 e129-e129 Published Online First: 24 Jul 2019. doi: 10.1136/annrheumdis-2019-215868
Correspondence response
Response to: ‘On using machine learning algorithms to define clinically meaningful patient subgroups’ by Pinal-Fernandez and Mammen

Olivier Benveniste Yves Allenbach Benjamin Granger
Annals of the Rheumatic Diseases 2019; 79 e130-e130 Published Online First: 20 Jul 2019. doi: 10.1136/annrheumdis-2019-216007

[1] ↵
. [Epub ahead of print: 24 May 2019 Spielmann L , Nespola B , Séverac F , et al . Anti-Ku syndrome with elevated CK and anti-Ku syndrome with anti-dsDNA are two distinct entities with different outcomes. Ann Rheum Dis 2019;78:1101–6.doi:10.1136/annrheumdis-2018-214439 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31126956
OpenUrl Abstract/FREE Full Text

[2] ↵

Lê S ,
Josse J ,
Husson F
. FactoMineR: an R package for multivariate analysis 2008;25.doi:10.18637/jss.v025.i01

[4] Lê S ,

[5] Josse J ,

[6] Husson F

[7] ↵

Uv L ,
Williamson RC ,
Guyon I
. Clustering: Science or Art? In: Isabelle G , Gideon D , Vincent L , et al. , eds. Proceedings of ICML workshop on unsupervised and transfer learning. Proceedings of Machine Learning Research: PMLR, 2012: 65–79.

[9] Uv L ,

[10] Williamson RC ,

[11] Guyon I

[12] ↵

Mariampillai K ,
Granger B ,
Amelin D , et al
. Development of a new classification system for idiopathic inflammatory myopathies based on clinical manifestations and myositis-specific autoantibodies. JAMA Neurol 2018;75:1528–37.doi:10.1001/jamaneurol.2018.2598
OpenUrl

[14] Mariampillai K ,

[15] Granger B ,

[16] Amelin D , et al

Log in using your username and password

Main menu

Log in using your username and password

You are here

Statistics from Altmetric.com

Request Permissions

References

Footnotes

Linked Articles

Read the full text or download the PDF:

Log in using your username and password