Stratification of knee osteoarthritis: two major patient subgroups identified by genome-wide expression analysis of articular cartilage

Introduction Osteoarthritis (OA) is a heterogeneous and complex disease. We have used a network biology approach based on genome-wide analysis of gene expression in OA knee cartilage to seek evidence for pathogenic mechanisms that may distinguish different patient subgroups. Methods Results from RNA-Sequencing (RNA-Seq) were collected from intact knee cartilage at total knee replacement from 44 patients with OA, from 16 additional patients with OA and 10 control patients with non-OA. Results were analysed to identify patient subsets and compare major active pathways. Results The RNA-Seq results showed 2692 differentially expressed genes between OA and non-OA. Analysis by unsupervised clustering identified two distinct OA groups: Group A with 24 patients (55%) and Group B with 18 patients (41%). A 10 gene subgroup classifier was validated by RT-qPCR in 16 further patients with OA. Pathway analysis showed increased protein expression in both groups. PhenomeExpress analysis revealed group differences in complement activation, innate immune responses and altered Wnt and TGFβ signalling, but no activation of inflammatory cytokine expression. Both groups showed suppressed circadian regulators and whereas matrix changes in Group A were chondrogenic, in Group B they were non-chondrogenic with changes in mechanoreceptors, calcium signalling, ion channels and in cytoskeletal organisers. The gene expression changes predicted 478 potential biomarkers for detection in synovial fluid to distinguish patients from the two groups. Conclusions Two subgroups of knee OA were identified by network analysis of RNA-Seq data with evidence for the presence of two major pathogenic pathways. This has potential importance as a new basis for the stratification of patients with OA for drug trials and for the development of new targeted treatments.

(5μm thickness) and stained with 0.1% safranin O-fast green for histological grading using a modified Mankin score, as previously described.
[1] The proteoglycan (GAG) content of cartilage tissue from the PLC and DMC was determined after overnight digestion in papain at 60 C using the dimethylmethylene blue (DMMB) assay with absorbance read at 570 nm and the total DNA was determined on the same digest using Hoechst method.[4,5]

RNA extraction
Total RNA was extracted from each patient sample of 200 to 400 mg of cartilage using TRIzol (LifeTechnologies) reagent and homogenisation (Braun Mikrodismembrator) following freezing in liquid nitrogen. The RNA was purified using RNeasy Qiagen clean-up columns (Qiagen) and for sequencing had a RIN score of >6 (2200 TapeStation, Agilent Technologies).

RT-qPCR
cDNA was synthesised from 0.5 to 1μg of total RNA using MLV reverse transcriptase and random hexamers (Life Technologies). For RT-qPCR analysis primer details are listed in Supplementary   Table 9. Gene expression was normalised to an average of CANX (calnexin), CSDE1 (cold shock domain containing E1) and EIF4G2 (eukaryotic translation initiation factor 4 gamma 2). Reference genes were chosen by identifying the least variable genes between the OA patients in our data.
Relative gene expression levels were determined using the 2 −ΔΔCt analysis method.
[6] Differences in expressed genes were identified using the Mann-Whitney U test where P-values ≤0.05 were considered significant.

Network NMF
The human ConsensusPathDB (v30) network of experimentally derived protein-protein interactions was downloaded and ID conversion performed using BioMart [11,12] An adjacency matrix of a nearest neighbours network derived from the graph Laplacian of an influence distance matrix was created. [13] Network NMF was implemented as a model in the NMF package framework within R.
[14] The parameter lambda which regulates the relative influence of the network on the clustering solution was found to produce stable results within a reasonable range as previously reported. A final lambda value of 25 was used. For comparison, standard NMF with the Lee algorithm was performed using the same expression data. [15] Consensus clustering with random sampling of 90% of the patients and 90% of the genes each run was performed. [16] This procedure was repeated 500 times and clustering outcomes were kept as a co-clustering matrix which records the frequency of each patient pair observed with membership in the same subtype, over all clustering runs in which both patients of the pair were sampled. The final clustering solutions were produced by hierarchical clustering of these co-clustering matrices. The number of subgroups (k) was trialled from k=2-4. The mean silhouette score were used to assess cluster stability and the patient fit to a subgroup. [17] The silhouette score was calculated using the euclidean distance. For patient i assigned to subgroup j the silhouette score of that patient S(i) was calculated as follows: For a patient sample i Then: S(i) therefore lies between 1 and -1 which indicate strong cluster similarity and dissimilarity respectively. The mean silhouette score for each subgroup was calculated as a measure of subgroup stability. Those patients showing poor association to their assigned cluster (negative silhouette score) were removed from subsequent analysis.

Reactome pathway analysis
Differentially expressed genes were used with the R package GOSeq to identify dysregulated Reactome pathways. [18,19] Pathways with Benjamini-Hochberg (BH) corrected p-values of ≤0.05 were regarded as significant.

PhenomeExpress sub-network identification
The Cytoscape PhenomeScape app was used to identify groups of interacting differentially expressed genes related to OA phenotypes. [20] The default human network and PhenomeExpress settings were used with a size parameter of 7. Phenotypes relevant to OA were chosen from the UberPheno ontology (MP:0003436, HP:0001387, MP:0003724, HP:0002758).

Creation of RT-qPCR panel
The Bioconductor package pamr was used with default settings to perform shrunken centroid clustering with the normalised RNA-Seq data and the identified patient subgroup labels to find the minimal gene set required to accurately classify the samples. [21] The RNA-Seq data was further filtered to keep only strongly expressed genes (median count 5000 reads across all the samples) so to allow robust RT-qPCR quantification. 5-fold cross-validation was performed using a nearest centroid classifier and the smallest gene set that provided the highest class prediction accuracy compared with the classifications made by the complete RNA-Seq dataset was used to inform the gene panel selection.

Classifiers for RNA-Seq and RT-qPCR
The R package caret was used to train support vector machine (SVM) classifiers. [22] The parameter c was tuned by 100 repeats of 5-fold cross-validation. The normalised, batch effect corrected RNA-Seq data and the RT-qPCR 2 −ΔΔCt values were used to train the classifiers and subsequently used to predict the unknown samples using each classifier. The R pROC package was used to calculate the AUC for the ROC of the RT-qPCR predictions to the RNA-Seq classifier predictions. [23] Identification of secreted proteins