Article Text

Predicting rapid progression in knee osteoarthritis: a novel and interpretable automated machine learning approach, with specific focus on young patients and early disease
  1. Simone Castagno1,
  2. Mark Birch1,
  3. Mihaela van der Schaar2,
  4. Andrew McCaskie1
  1. 1 Department of Surgery, University of Cambridge, Cambridge, UK
  2. 2 Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
  1. Correspondence to Dr Simone Castagno; sc2257{at}cam.ac.uk

Abstract

Objectives To facilitate the stratification of patients with osteoarthritis (OA) for new treatment development and clinical trial recruitment, we created an automated machine learning (autoML) tool predicting the rapid progression of knee OA over a 2-year period.

Methods We developed autoML models integrating clinical, biochemical, X-ray and MRI data. Using two data sets within the OA Initiative—the Foundation for the National Institutes of Health OA Biomarker Consortium for training and hold-out validation, and the Pivotal Osteoarthritis Initiative MRI Analyses study for external validation—we employed two distinct definitions of clinical outcomes: Multiclass (categorising OA progression into pain and/or radiographic) and binary. Key predictors of progression were identified through advanced interpretability techniques, and subgroup analyses were conducted by age, sex and ethnicity with a focus on early-stage disease.

Results Although the most reliable models incorporated all available features, simpler models including only clinical variables achieved robust external validation performance, with area under the precision-recall curve (AUC-PRC) 0.727 (95% CI: 0.726 to 0.728) for multiclass predictions; and AUC-PRC 0.764 (95% CI: 0.762 to 0.766) for binary predictions. Multiclass models performed best in patients with early-stage OA (AUC-PRC 0.724–0.806) whereas binary models were more reliable in patients younger than 60 (AUC-PRC 0.617–0.693). Patient-reported outcomes and MRI features emerged as key predictors of progression, though subgroup differences were noted. Finally, we developed web-based applications to visualise our personalised predictions.

Conclusions Our novel tool’s transparency and reliability in predicting rapid knee OA progression distinguish it from conventional ‘black-box’ methods and are more likely to facilitate its acceptance by clinicians and patients, enabling effective implementation in clinical practice.

  • Knee Osteoarthritis
  • Machine Learning
  • Arthritis
  • Osteoarthritis

Data availability statement

Data are available in a public, open access repository. Data and/or research tools used in the preparation of this manuscript were obtained and analysed from the controlled access data sets distributed from the Osteoarthritis Initiative (OAI), a data repository housed within the National Institute of Mental Health (NIMH) Data Archive. OAI is a collaborative informatics system created by the NIMH and the National Institute of Arthritis, Musculoskeletal and Skin Diseases to provide a worldwide resource to quicken the pace of biomarker identification, scientific investigation and OA drug development. (DOI: 10.15154/1vhq-h028).Data provided from the FNIH OA Biomarkers Consortium Project (available at https://nda.nih.gov/oai/) made possible through grants and direct or in-kind contributions by: AbbVie; Amgen; Arthritis Foundation; Artialis; Bioiberica; BioVendor; DePuy; Flexion Therapeutics; GSK; IBEX; IDS; Merck Serono; Quidel; Rottapharm | Madaus; Sanofi; Stryker; the Pivotal OAI MRI Analyses study, NIH HHSN2682010000 21C; and the Osteoarthritis Research Society International. The OAI is a public-private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health. Funding partners include Merck Research Laboratories; Novartis Pharmaceuticals, GlaxoSmithKline; and Pfizer. Private sector funding for the consortium and OAI is managed by the Foundation for the National Institutes of Health. Code availability. The AutoPrognosis V.2.0 open-source software package is available at https://www.autoprognosis.vanderschaar-lab.com/.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Osteoarthritis (OA) is a common degenerative joint disease which causes pain, stiffness and reduced joint motion. It affects over 500 million people globally, leading to significant healthcare costs.

  • The heterogeneity of OA makes the development of effective clinical therapies challenging with current treatments focusing on symptomatic relief and, in advanced cases, joint replacement.

  • Identifying patients at risk for rapid OA progression is a fundamental aspect of accurate patient stratification, allowing for the development of new treatments and more strategic clinical trial recruitment (especially in younger patients and those with early-stage disease).

  • Machine learning (ML) is recognised as having significant potential in early-stage disease prediction.

WHAT THIS STUDY ADDS

  • We developed an autoML tool for predicting rapid knee OA progression, using clinical, X-ray, MRI and biochemical data.

  • We demonstrated robust performance of models which included only clinical or ‘core’ variables, facilitating their practical implementation in clinical settings where extensive data collection is not always feasible.

  • We identified key predictors of OA progression, such as patient-reported outcome measures (PROMs) and MRI features, enhancing model transparency and potential for clinical adoption.

  • To the best of our knowledge, we are the first to apply these predictive models and assess feature importance in multiple subgroups of patients with OA.

  • We also developed web-based applications called clinical demonstrators to facilitate understanding and visualisation of our models’ personalised predictions.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Our predictive models for knee OA progression could significantly impact clinical practice by enabling earlier and more accurate identification of high-risk patients (particularly younger individuals and those with early-stage OA), thereby guiding timely and targeted interventions.

  • By integrating PROMs, clinical, biochemical, and imaging data into our models, our approach has the potential to be extrapolated to other complex degenerative diseases, which often rely on this information for patient monitoring and management.

Introduction

Osteoarthritis (OA) is a degenerative joint disease whose primary symptoms are pain, stiffness and reduced joint motion.1 2 OA affects over 500 million people worldwide3 and its direct healthcare costs are estimated to be 1–2.5% of national gross domestic product.4 The heterogeneity of the disease presents a significant challenge in developing effective clinical therapies.5–7 As a result, there is a clear global unmet clinical need with no approved treatments to halt or reverse disease progression.8 The primary treatment options remain focused on providing symptomatic relief and, in advanced cases, resorting to prosthetic joint replacement.9

Identifying patients at risk for rapid OA progression is crucial for accurate patient stratification, particularly in the early stages of the disease and among younger patients, a demographic increasingly affected by OA.10 11 This stratification is key to successful patient selection in clinical trials to develop and evaluate new treatments.12 Younger individuals frequently face a ‘treatment gap’,13 where conservative management often falls short in managing symptoms and arthroplasty, though potentially beneficial, may not suit their active lifestyles and carries a higher risk of aseptic loosening and future revision surgery. To optimise non-surgical and surgical approaches ahead of joint replacement (including regenerative therapies aimed at joint preservation), a stratified approach is necessary. We hypothesise that machine learning (ML), a branch of artificial intelligence (AI) that uses algorithms to learn from data and make predictions without being explicitly programmed to do so,14 15 can be leveraged to identify individuals with OA who are at risk of rapid progression, especially in early stages of disease.

In this study, we introduce and validate an innovative, interpretable automated ML (autoML) tool to predict the rapid progression of knee OA (the most common form of OA4 16), focusing on early-stage disease and a younger patient demographic.

Methods

Data sets

Data used in this study are from the Osteoarthritis Initiative (OAI),17 a multicentre, longitudinal, prospective observational study of 4796 men and women aged 45–79, designed to identify biomarkers and risk factors for the development and progression of OA. OAI data are publicly available and can be accessed at https://nda.nih.gov/oai/.

ML models were developed and trained using data from the Foundation for the National Institutes of Health (FNIH) OA Biomarkers Consortium Project,18 19 a nested case-controlled study of 600 patients (one index knee per participant) selected from the OAI. Patients in this study were followed up for a total of 4 years at 1-year intervals, and inclusion criteria included the presence of at least one knee with a Kellgren and Lawrence grade (KLG) of 1–3 at baseline.12 Data collected included clinical, radiographic (X-ray), MRI, blood and urine biospecimen data17 in a tabular format.

Additionally, patients from the Pivotal Osteoarthritis Initiative MRI Analyses (POMA) study20–22 were used to further validate our models. POMA is a nested case-controlled study within the OAI, aimed at understanding the progression of OA using MRI.

Preprocessing and class definition

We followed a similar methodology to that used by Widera et al.23 For each patient, we used all available periods that were 2 years in length (ie, baseline to year 2, year 1 to year 3 and year 2 to year 4): Therefore, each instance represented a period, rather than a patient. For each period, we defined four outcome classes:

  • Class 0: No disease progression.

  • Class 1: Pain-only progression—based on Western Ontario and McMaster (WOMAC) pain scores (ranging 0–20).

  • Class 2: Radiographic-only progression—based on minimum medial joint space width (JSW) and KLG

  • Class 3: Both pain and radiographic progression.

The exact definitions of pain and radiographic progression are as follows:

Pain progression (Eq. 1):

An increase of at least 2 points in the WOMAC pain scale over a two-year period Embedded Image AND substantial pain at the end of the period Embedded Image

OR

A rapid increase in pain Embedded Image AND a lower end pain Embedded Image

OR

Sustained substantial pain throughout the period (Embedded Image AND Embedded Image ).

Embedded Image (Eq.1)

Radiographic progression (Eq. 2):

A decrease in minimum medial JSW of at least 0.6 mm over a 2-year period

A KLG of 4 at the end of the period Embedded Image —this condition was introduced to identify patients with radiographic ‘end-stage’ OA at the end of the period, independently of medial JWS narrowing.

Embedded Image (Eq.2)

Periods were excluded if the outcome class could not be assigned due to missing values, resulting in a total of 1691 instances. Variables with more than 85% missing values and those not relevant to our analysis, such as patient ID, visit number, dates and barcodes were also removed. This resulted in a total of 304 features for analysis. Online supplemental table 1 shows all variables with their definitions.

Supplemental material

The above process was then repeated using binary class labels only, with Class 0 representing ‘non-progressors’ and Class 1 ‘progressors’.

We performed an 80–20 training-testing split on the data set, ensuring that instances with the same patient ID were consistently placed in either the training or testing set. This resulted in a training set with 1353 instances and a hold-out (or testing) set with 338. Model development and training were exclusively conducted on the training set while the testing set was held out for further validation (figure 1 shows a schematic overview of our study methodology).

Figure 1

Methodology overview. This figure delineates our methodical approach towards model development and validation. Initially, our data set underwent a random partitioning: 80% allocated to the training set (Ntraining=1353) and 20% to the hold out (or testing) set (Nhold-out=338). The training phase was strictly confined to the training set preserving the testing set for subsequent validation. Predictive models for rapid knee OA progression were built using AutoPrognosis V.2.0, and key predictors of progression were identified through post-hoc interpretability analysis. Model reliability was rigorously evaluated via internal validation on the training set and hold-out validation on the testing set. Additionally, further validation was conducted by testing our clinical and streamlined models (incorporating only the top five predictors) on patients from the POMA study. (Created with BioRender.com). FNIH, Foundation for the National Institutes of Health; OA, osteoarthritis; POMA, Pivotal Osteoarthritis Initiative MRI Analyses.

Model development using AutoPrognosis V.2.0

AutoPrognosis V.2.0 was used to develop models predicting accelerated knee OA progression. The framework, which is an updated and enhanced version of the original AutoPrognosis,24 uses advanced optimisation techniques to automatically create a weighted ensemble of ML pipelines, tailored to the specific variables and outcomes of the study population.25 These pipelines include choices for data imputation, feature processing and classification algorithms, along with their respective hyperparameters. AutoPrognosis V.2.0 design space encompasses 7 feature scaling algorithms, 7 feature selection algorithms, 12 imputation algorithms and 23 classification algorithms (full list in online supplemental table 2). In this study, to enhance computational efficiency, we used the default classification algorithms of AutoPrognosis V.2.0 (highlighted in bold in online supplemental table 2), selected for their speed and efficiency.

Supplemental material

Our model was trained by conducting 100 iterations of Bayesian optimisation.24 At each iteration, the algorithm searched for a new ML pipeline and optimised its hyperparameters. Area under the precision-recall curve (AUC-PRC) was used to evaluate the performance of each pipeline and three weighted ML pipelines were combined to produce the final model. AUC-PRC was chosen as a pipeline evaluation metric because it can be applied to both binary and multi-label classification tasks, effectively addresses the class imbalance in our data set and enables performance comparison independent of classification thresholds. Additionally, AUC-PRC allows for the detection and differentiation of both positive and negative cases, providing a more comprehensive evaluation of model performance.26

Model development and training were performed for various data subsets: (1) Clinical data including demographic information, patient-reported outcome measures (PROMs) and simple X-ray features such as KLG, joint space narrowing and medial minimum JSW (detailed in online supplemental table 1); (2) clinical and X-ray data with advanced X-ray features such as fractal bone trabecular integrity; (3) biochemical markers; (4) MRI data; and (5) the entire data set. The whole analysis was performed for both multiclass and binary predictions.

Additionally, streamlined models were built using only five ‘core’ variables, identified in our post-hoc interpretability analysis as pivotal in influencing model predictions.

Model interpretation

Another benefit of AutoPrognosis V.2.0 is its integration of advanced model interpretability tools that enable the evaluation of variables’ contributions to model predictions.

A post-hoc interpretability tool called ‘KernelSHAP’ was employed to agnostically assess the relative importance of features used to build our models. ‘KernelSHAP’ uses a weighted linear regression model to compute the importance of each feature.27 The five most highly ranked attributes were selected as ‘core’ variables and used for the development of new prediction models.

Validation of model performance

Stratified 10-fold cross-validation with three random seeds was conducted on our training set. The models, optimised for AUC-PRC during the development phase, were evaluated using multiple metrics: AUC-PRC, area under the receiver operating characteristic curve (AUC-ROC), weighted precision (or positive predictive value), weighted recall (also known as sensitivity or true-positive rate) and weighted F1-score (which is the harmonic mean of precision and recall). By ‘weighted’ we intend the average metric for all labels, weighted by the number of true instances for each label. For simplicity, in the rest of this paper we have omitted the word ‘weighted’ when discussing these metrics. For each metric, AutoPrognosis V.2.0 also allowed the calculation of 95% CIs.

Further cross-validation was conducted on the hold-out set (representing unseen data excluded from model development and training) and the external data set containing baseline data from the POMA study (figure 1). For this data set, knee OA outcomes were assessed at the 2-year follow-up time point. From the 1170 patients in the POMA study, 183 were also part of the FNIH OA Biomarkers Consortium and were therefore excluded from our validation set. Consequently, the validation cohort consisted of 987 patients encompassing 601 right and 502 left knees (1103 instances in total). Knees lacking sufficient data for outcome class assignment due to missing values were omitted. When data for both knees were available for a patient, only one knee was randomly selected, resulting in a total of 705 patients (383 right, 322 left knees).

Subgroup analysis

Subgroup analyses by age (age<60 vs age≥60), sex and ethnicity were conducted using the hold-out set. Evaluations were then performed on three distinct subgroups within the external data set: Patients under 60 years, patients without initial X-ray signs of OA (KLG 0, a demographic not included in our training set) and patients displaying early-stage OA (KLG 0–1). Online supplemental figure 1 illustrates our subgroup analysis schematically.

Supplemental material

Clinical demonstrators

A working prototype of a web-based application or ‘clinical demonstrator’ was also developed to illustrate the practical application of our tool to predict rapid knee OA progression (although it is not currently intended for use on any individual, including in any clinical or medical setting). This demonstrator was built and deployed using ‘Streamlit’ (https://streamlit.io/).

Results

Study population

The complete data set included 1691 instances, of which 41% were men and 59% were women, with ages ranging between 45 and 81 (online supplemental table 3). 60.6% of instances were OA non-progressors (Class 0), 7.7% pain-only progressors (Class 1), 25.9% radiographic-only progressors (Class 2) and 5.7% both pain and radiographic progressors (Class 3).

Supplemental material

Model performance

Table 1 shows the predictive performance of all our models developed with AutoPrognosis V.2.0 while the final ML pipeline ensembles of each model are illustrated in online supplemental table 4.

Supplemental material

Table 1

Models’ performance on 10-fold cross-validation. Cross-validation performance of autoML models created using AutoPrognosis V.2.0 for multiclass and binary predictions of rapid knee OA progression

The highest performance was achieved when all 304 variables were included (models AP5_mu and AP5_bi in table 1) with AUC-PRC 0.678 (95% CI: 0.676 to 0.680) for multiclass predictions; and AUC-PRC 0.635 (95% CI: 0.629 to 0.641) for binary predictions. The lowest performance was observed in models AP3_mu and AP3_bi, trained solely on biochemical marker data, with AUC-PRC 0.600 (95% CI: 0.597 to 0.603) and 0.523 (95% CI: 0.509 to 0.537), respectively. AUC-PRC and AUC-ROC were higher in multiclass models, whereas F1-score, precision and recall were higher in binary models.

Additionally, models AP5_top5_mu and AP5_top5_bi created using only five ‘core’ variables (identified by post-hoc interpretability analysis as the strongest contributors to the models’ predictions as outlined in figure 2 and online supplemental table 5), achieved performance scores similar to those of the larger models (AUC-PRC 0.648 (95% CI: 0.647 to 0.649) and 0.618 (95% CI: 0.613 to 0.623), respectively).

Supplemental material

Figure 2

Overall feature importance. This figure illustrates the overall importance of features in models AP5_mu (left) and AP5_bi (right). All 304 variables were included in the analysis. A full description of each feature is outlined in online supplemental table 1.

Notably, models AP1_mu and AP1_bi (including clinical data and simple X-ray features like KLG) also demonstrated robust performance: AP1_mu achieved AUC-PRC 0.648 (95% CI: 0.646 to 0.650); whereas AP1_bi yielded AUC-PRC 0.613 (95% CI: 0.605 to 0.621).

Model interpretation

Figure 2 illustrates the overall impact of features in models AP5_mu and AP5_bi (encompassing all 304 variables) ranked according to their contributions to predictive outcomes. WOMAC pain and disability scores as well as MRI features such as MRI Osteoarthritis Knee Score (MOAKS) and percentage area of subchondral bone denuded of cartilage, emerged as the strongest predictors. Detailed descriptions of the five ‘core’ variables used to create our streamlined models are presented in online supplemental table 5, while descriptions of all other variables are shown in online supplemental table 1. Feature MOSFMA (a component of MOAKS used to assess the size of osteophytes at the femur's medial anterior trochlear region) was used as a ‘core’ feature in model AP5_top5_bi in place of biochemical marker Urine_alpha_NUM (urine CTX-1a) as the latter was not available in the external data set.

The impact distribution and average impact magnitude for the most important features across each outcome class in these models are illustrated in figures 3 and 4.

Figure 3

Assessment of feature impact on multiclass predictions. Assessment of feature impact on non-progression (Class 0, panel A), pain-only progression (Class 1, panel B), radiographic progression (Class 2, panel C) and both pain and radiographic progression (Class 3, panel D), using ‘Kernel-SHAP’ for multiclass predictions with model AP5_mu. Left—impact distribution of the most important features. The colour represents the feature value (red=high, blue=low). A positive SHAP value represents a positive impact on class prediction. Right—average impact magnitude of the most important features on class prediction.

Figure 4

Assessment of feature impact on binary predictions. Assessment of feature impact on non-progression (Class 0) and progression (Class 1), using “Kernel-SHAP” for binary predictions with model AP5_bi. Left and middle—impact distribution of the most important features for Class 0 (left) and Class 1 (middle). The colour represents the feature value (red=high, blue=low). A positive SHAP value represents a positive impact on class prediction. Right—average impact magnitude of the most important features on class prediction.

For multiclass predictions, MRI features and WOMAC scores were the most significant contributors across all outcome classes (figure 3). Urine CTX-1a (Urine_alpha_NUM) emerged as the most important biochemical marker significantly affecting the prediction of all classes. Pain-only progression (Class 1) was also influenced by BTI_H2 (an imaging biomarker used to assess the microstructural integrity of trabecular bone in the horizontal plane), the use of medication for knee pain, aching or stiffness (P01KPMEDCV) and age (figure 3B).

Similar results were observed for binary predictions except for a stronger contribution from urine CTX-1a and serum hyaluronic acid (Serum_HA_NUM) (figure 4).

Validation of model performance

Hold-out validation

Models AP1_mu and AP1_bi (only clinical features), AP5_mu and AP5_bi (all available features) and AP5_top5_mu and AP5_top5_bi (five ‘core’ features) were validated on the hold-out set. All models obtained similar performance scores to those from internal cross-validation, as shown in table 2. Again, multiclass models yielded higher AUC-PRC and AUC-ROC scores while binary models had greater F1-score, precision and recall.

Table 2

Validation of models’ performance. Hold-out and external validation performance scores for multiclass and binary predictions. External validation was conducted on patients from the POMA study

Interestingly, clinical models AP1_mu and AP1_bi, and streamlined models AP5_top5_mu and AP5_top5_bi achieved similar or better performance than the most comprehensive models.

Precision-recall curves (PRCs) and confusion matrices for each model are displayed in online supplemental figure 2 and online supplemental figure 3.

Supplemental material

Supplemental material

External validation

Due to the absence of several features in the POMA data set (including biochemical markers and complex X-ray features), only clinical models AP1_mu and AP1_bi and streamlined models AP5_top5_mu and AP5_top5_bi were further validated on this external set.

The POMA data set exhibited comparable proportions across outcome classes to our training set; however, it included a much greater fraction of patients with KLG 1 (32.9% vs 11.0%), and, notably, a substantial number of patients with KLG 0 (23.4%), a group absent from our training set (online supplemental table 6).

Supplemental material

Highest performance was achieved by models AP1_mu and AP1_bi with AUC-PRC 0.727 (95% CI: 0.726 to 0.728) and 0.764 (95% CI: 0.762 to 0.766), respectively. All external validation results are presented in table 2 while PRCs and confusion matrices for each model are displayed in online supplemental figure 4 and online supplemental figure 5.

Supplemental material

Supplemental material

Subgroup analysis

Hold-out subgroups

The demographic profiles of the hold-out subpopulations studied are presented in online supplemental table 7. Only White and Black ethnicities were analysed due to the small number of patients belonging to the other groups.

Supplemental material

The performance scores achieved by models AP5_mu and AP5_bi (our most comprehensive models) for each subgroup are presented in table 3. Model AP5_mu demonstrated improved performance in patients aged≥60 with AUC-PRC 0.685 compared with 0.644 in those aged<60. Sex differences were also evident with female patients achieving an AUC-PRC of 0.707, significantly higher than the 0.619 observed in male patients. Ethnicity-related performance varied with patients of White ethnicity showing an AUC-PRC of 0.706, markedly higher than the 0.558 observed in Black and African-American patients. Similarly, model AP5_bi showed higher AUC-PRC scores for female patients (0.702) compared with men (0.522) and better performance in younger patients (AUC-PRC 0.676 for age<60) compared with older ones (AUC-PRC 0.564 for age≥60). Ethnicity disparities persisted in binary models with White patients achieving an AUC-PRC of 0.563 versus 0.632 for Black patients.

Table 3

Models’ performance in subgroup analysis. Models’ performance in various subgroups from the hold-out and external validation sets

The results of our post-hoc interpretability analyses of each subgroup are illustrated in figure 5. For multiclass predictions, WOMAC pain and disability scores were particularly significant for all subgroups, especially for young, women and Black patients. MRI features, including MOAKS, cartilage thickness and the percentage area of subchondral bone denuded of cartilage also consistently ranked highly across all subgroups. Urine CTX-1a emerged once again as the most important biochemical marker, especially for patients of Black ethnicity.

Figure 5

Overall feature importance in the hold-out subgroup analysis. This figure illustrates the overall importance of features in models AP5_mu (left) and AP5_bi (right) for the following subgroups in the hold-out data set: (A) Age<60; (B) Age≥60; (C) Male; (D) Female; (E) White ethnicity; and (F) Black ethnicity.

For binary predictions, WOMAC disability score and MRI features remained important predictors across all subgroups. Urine CTX-1a also demonstrated a very strong contribution while serum hyaluronic acid emerged as an additional important predictor, especially in young patients. WOMAC pain, on the other hand, was significantly less influential in binary models compared with multiclass models.

Online supplemental figures 6–17 illustrate the impact distribution and average impact magnitude of the most important features across each outcome class for all subgroups.

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

External subgroups

Online supplemental file 8 shows the demographic characteristics of the subpopulations in the external validation set. Notably, the young cohort exhibited significantly higher proportions of knees classified as KLG 0 or 1 (27.8% and 41.3%, respectively), in comparison to our training data set (0% and 11.0%). Additionally, subgroups with early-stage OA (KLG 0–1) and no initial radiographic signs of OA (KLG 0) demonstrated substantially greater rates of non-progression (74.9% and 74.4%) than observed in our training set (60.6%).

Supplemental material

Performances of models AP1_mu, AP1_bi, AP5_top5_mu and AP5_top5_bi on these subgroups are presented in table 3. Both multiclass models achieved high predictive performance, particularly in the KLG 0–1 and KLG 0 subgroups (AUC-PRC 0.724–0.806).

In contrast, binary models exhibited comparatively lower AUC-PRC and AUC-ROC scores, but higher F1-score, precision and recall. Although performance remained robust in the young subgroup (AUC-PRC 0.617–0.693), both binary models showed mixed results in the KLG 0–1 and KLG 0 cohorts with low AUC-PRC and AUC-ROC values, but high F1-score, precision and recall.

Clinical demonstrators

Clinical demonstrators were built using our clinical models and can be accessed via these links:

WOMAC pain and disability scores were not included as variables in these prototypes to prevent any possible copyright infringement.

These clinical demonstrators allow intuitive and streamlined visualisation of our models’ predictions for individual patients along with the relative impact of each feature on these personalised predictions, as elucidated by Kernel SHAP.

Discussion

We developed autoML models to predict rapid knee OA progression over 2 years. Our most reliable models incorporated clinical, X-ray, MRI and biochemical features resulting in an ‘information gain’ compared with models using only a subset of these data. Additionally, AutoPrognosis V.2.0 introduced a ‘modelling gain’, by selecting the most suitable algorithms in a fully data-driven manner, without prior assumptions. In light of this ‘modelling gain’, model performance was not significantly affected when only ‘core’ variables were used. This is important as it facilitates the translation of our models to clinical practice where it may not be feasible, nor logical, to measure over 300 variables for each patient.

Despite several studies previously attempting to predict knee OA progression using baseline biomarker data (with or without ML), direct comparison with our research is impossible, primarily due to inconsistent definitions of OA progression and validation methods. For instance, Hunter et al 19 employed logistic regression models with imaging and biochemical markers achieving AUC-ROC 0.716–0.732 for radiographic progression and AUC-ROC 0.668–0.694 for both pain and radiographic progression over 4 years. Widera et al,23 in contrast, constructed random forest models to predict progression over 2 years, using similar class definitions to ours but relying solely on clinical and X-ray data, resulting in F1-scores of 0.560–0.698.

Unlike earlier research, we took a completely data-driven approach to model development by employing AutoPrognosis V.2.0. We incorporated a wider variety of data types, including clinical data, PROMs, X-rays, MRIs and biochemical markers which enriched our predictive models. We also made significant efforts to enhance the transparency of our models through post-hoc interpretability analysis and the development of clinical demonstrators.

We agnostically identified key predictors of rapid knee OA progression, particularly PROMs like WOMAC pain and disability scores, MRI features such as MOAKS and area of subchondral bone denuded of cartilage and biochemical markers such as urine CTX-1a. We believe this transparency will help build trust among clinicians and patients, potentially accelerating healthcare adoption. Furthermore, our analysis highlights the importance of PROMs in prognostic modelling of a complex condition like knee OA, reflecting a critical step towards the humanisation of AI in healthcare.28 By incorporating PROMs, our tool assimilates the patients’ own perceptions of their symptoms, empowering collaborative, informed healthcare decisions.

To the best of our knowledge, this study is the first to apply these predictive models and assess feature importance in multiple OA patient subgroups, including patients under 60 who constitute a significant proportion of the knee OA population and may particularly benefit from early intervention.10 11 13 29 Subgroup analysis is essential to identify and address potential biases ensuring the models’ accuracy and applicability across diverse populations.30 31

A critical component of the study was the thorough validation of our models using multiple performance metrics alongside techniques such as stratified 10-fold cross-validation, hold-out validation and external validation with the POMA cohort, representing a separate, nested study within the broader OAI framework. Although this methodology confirmed our models’ reliability, future research in diverse clinical settings and new cohorts is essential to assess their clinical utility and generalisability across diverse patient demographics.32

Even though our training cohort included only patients with radiographic evidence of knee OA (KLG 1–4), our models demonstrated robust performance when validated on the POMA data set which has a high proportion of patients with KLG 0–1. Interestingly, models using only clinical variables showed the strongest external validation performance (despite missing features in the external data set preventing validation of the most comprehensive models). Relying on clinical features is advantageous in clinical practice as they are inexpensive and easily collected. This is particularly relevant in resource-constrained environments where comprehensive data collection might be challenging.

Our multiclass models demonstrated high predictive performance in younger patients and those with early-stage OA, offering the dual advantage of reliability in high-risk groups and patient phenotyping based on progression type. This aligns with our aim to predict early disease progression, providing a potential ‘window of opportunity’ for interventions (ranging from lifestyle modifications and rehabilitation, to reparative and regenerative therapies) to arrest or slow down disease progression.33 In contrast, our binary models, while performing well on the entire POMA study cohort, showed mixed performance across metrics when applied to early-OA subgroups. This underscores the need to refine these models by incorporating data specifically from patients in the early stages of OA.

Our study has other limitations that should be addressed in future work. The use of data sets from the same overall study (OAI) for both training and validation may restrict generalisability despite employing cross-validation techniques and conducting validation on multiple data sets and subgroups. Future research should validate these models on completely independent data sets from diverse geographic and demographic backgrounds to ensure broader applicability. Additionally, although WOMAC scores are commonly used in research, their copyright protection may limit their use in clinical practice. Finally, when validating our models, confusion matrices revealed that classes with the smallest sample sizes were less accurately predicted, especially in the multiclass models. However, the accuracy of these minority classes can be significantly improved by adjusting the probability threshold during class assignment (as demonstrated by the PRCs) and the same models achieved high AUC-PRC and AUC-ROC indicating strong overall performance independent of classification thresholds.

While these limitations highlight the need for model refinement and further training prior to clinical implementation, this study demonstrates the significant potential of a fully data-driven autoML approach and the utility of biomarker identification and subgroup analysis in predicting knee OA progression.

We believe our approach is not only applicable to OA but could also serve as a model for other complex degenerative conditions (such as multiple sclerosis and Parkinson’s disease) which share common challenges, including chronicity, unmet clinical needs and difficulties in early diagnosis.34 35 By tailoring data inputs and fine-tuning models to these diseases, our method holds significant potential for the prediction and monitoring of such conditions. This ML application represents a step towards a more tailored and precise approach to healthcare, addressing on the one hand the personalised needs of the individual patient while on the other delivering impact on a societal scale.

Data availability statement

Data are available in a public, open access repository. Data and/or research tools used in the preparation of this manuscript were obtained and analysed from the controlled access data sets distributed from the Osteoarthritis Initiative (OAI), a data repository housed within the National Institute of Mental Health (NIMH) Data Archive. OAI is a collaborative informatics system created by the NIMH and the National Institute of Arthritis, Musculoskeletal and Skin Diseases to provide a worldwide resource to quicken the pace of biomarker identification, scientific investigation and OA drug development. (DOI: 10.15154/1vhq-h028).Data provided from the FNIH OA Biomarkers Consortium Project (available at https://nda.nih.gov/oai/) made possible through grants and direct or in-kind contributions by: AbbVie; Amgen; Arthritis Foundation; Artialis; Bioiberica; BioVendor; DePuy; Flexion Therapeutics; GSK; IBEX; IDS; Merck Serono; Quidel; Rottapharm | Madaus; Sanofi; Stryker; the Pivotal OAI MRI Analyses study, NIH HHSN2682010000 21C; and the Osteoarthritis Research Society International. The OAI is a public-private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health. Funding partners include Merck Research Laboratories; Novartis Pharmaceuticals, GlaxoSmithKline; and Pfizer. Private sector funding for the consortium and OAI is managed by the Foundation for the National Institutes of Health. Code availability. The AutoPrognosis V.2.0 open-source software package is available at https://www.autoprognosis.vanderschaar-lab.com/.

Ethics statements

Patient consent for publication

Ethics approval

This study was performed retrospectively using data from human subjects which was openly accessible through the Osteoarthritis Initiative (OAI) (https://nda.nih.gov/oai). Since the OAI had already secured ethical approval and obtained informed consent from the participants and the data was released under an open access permission group, there was no need for additional ethical approval for our study. All individuals had provided informed consent prior to their inclusion in the OAI study.

Acknowledgments

We extend our gratitude to the participants of the Osteoarthritis Initiative for their invaluable contributions to this research. Their willingness to share data and experiences has been instrumental in advancing our understanding of osteoarthritis. Additionally, we acknowledge the dedicated members of the patient and public involvement panel at Addenbrooke’s Hospital, Cambridge, UK, for their insights and guidance which have greatly enriched the scope and relevance of our study. A previous version of our work was presented at the 2023 European Orthopaedic Research Society and British Orthopaedic Research Society conferences.

References

Supplementary materials

Footnotes

  • Handling editor Josef S Smolen

  • Contributors All authors contributed to the conceptualisation and design of the study. SC contributed to the curation and analysis of the data. MB, MvdS and AM supervised the study. All authors contributed to the interpretation of the data, the drafting of the article and final approval of the version to be submitted. AM is the guarantor of the study. ChatGPT, an AI language model developed by OpenAI, was used exclusively to assist in improving the clarity and legibility of few sentences in the initial drafting of the manuscript, though these sections have been substantially revised by the authors to generate the final version. It did not contribute to the creation of content or the analysis of data.

  • Funding SC is supported by the Louis and Valerie Freedman Studentship in Medical Sciences from Trinity College Cambridge, the ORUK/Versus Arthritis: AI in MSK Research Fellowship (G124606) and the Addenbrooke’s Charitable Trust (ACT) Research Advisory Committee grant (G123290). At the start of the study, SC was also supported by the National Institute for Health and Care Research (NIHR) (ACF-2021-14-003). AM and MB are supported by the NIHR Cambridge Biomedical Research Centre (NIHR203312) and receive funding from Versus Arthritis (grant 21156). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. The funders of the study were not involved in the design, data collection, analysis, interpretation or writing of this study.

  • Competing interests None declared. We confirm that we have read the journal’s position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

  • Patient and public involvement Patients and the public were involved early in our research, contributing to the development of our research questions and outcome measures. Their input, gathered through a focus group with the Patient and Public Involvement team at Addenbrooke’s Hospital, Cambridge, UK, informed the design of our study and our clinical demonstrators. While direct involvement in recruitment and study conduct was not applicable due to the nature of our data, their perspectives on the usability and implications of our research were integral. Our dissemination strategy includes regular interactions with this group, collaborations with patient groups and relevant charities (such as Osteoarthritis Research UK (ORUK) and Versus Arthritis), and public-friendly summaries of our findings to ensure ongoing, reciprocal communication and feedback.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.