Molecular pathways in patients with systemic lupus erythematosus revealed by gene-centred DNA sequencing

Objectives Systemic lupus erythematosus (SLE) is an autoimmune disease with extensive heterogeneity in disease presentation between patients, which is likely due to an underlying molecular diversity. Here, we aimed at elucidating the genetic aetiology of SLE from the immunity pathway level to the single variant level, and stratify patients with SLE into distinguishable molecular subgroups, which could inform treatment choices in SLE. Methods We undertook a pathway-centred approach, using sequencing of immunological pathway genes. Altogether 1832 candidate genes were analysed in 958 Swedish patients with SLE and 1026 healthy individuals. Aggregate and single variant association testing was performed, and we generated pathway polygenic risk scores (PRS). Results We identified two main independent pathways involved in SLE susceptibility: T lymphocyte differentiation and innate immunity, characterised by HLA and interferon, respectively. Pathway PRS defined pathways in individual patients, who on average were positive for seven pathways. We found that SLE organ damage was more pronounced in patients positive for the T or B cell receptor signalling pathways. Further, pathway PRS-based clustering allowed stratification of patients into four groups with different risk score profiles. Studying sets of genes with priors for involvement in SLE, we observed an aggregate common variant contribution to SLE at genes previously reported for monogenic SLE as well as at interferonopathy genes. Conclusions Our results show that pathway risk scores have the potential to stratify patients with SLE beyond clinical manifestations into molecular subsets, which may have implications for clinical follow-up and therapy selection.


Subjects and DNA samples
The Swedish SLE cohorts included 1,167 SLE patients recruited at the Rheumatology clinics at the

Targeted DNA sequencing
Targeted DNA sequencing was performed in the Swedish SLE case-control cohorts. The design of the sequence capture panel and the library preparation has been described elsewhere.(4) In brief, a custom SeqCap EZ Choice XL library (Roche NimbleGen, Basel, Switzerland) was designed to target 1,853 genes, selected based on their known or suspected roles in immunological or autoimmune diseases in humans or model organisms.(4) The genomic intervals for all alternative transcripts were retrieved from NCBI36/hg18. Besides the coding exons, 5′ and 3′ UTRs, potential promoter regions were also included, as well as regions of mammalian conservation within 100kb up-and downstream of the genes. (5) In total, the designed probes covered 32.3 Mbp. Sequencing libraries were prepared by ultrasonication of up to 1 μg of high molecular weight DNA into around 400 bp fragments (Covaris E220, Woburn, MA, USA), that were then barcoded (NEXTflex-96 DNA barcode adapters, Bioo Scientific, Austin, TX, USA). Samples were pooled in batches of eight, hybridized (Roche NimbleGen) and sequenced with 100-bp paired-end reads using Illumina HiSeq 2500 version 3 or 4 chemistry (Illumina Inc, San Diego, CA, USA). An average sequencing depth of 35× per sample was achieved.

Alignment and variant calling
A pipeline based on GATK "best practices" was used for variant discovery.(6) Raw reads were mapped to the hg19 human reference genome using the Burrows-Wheeler aligner 0.7.12 (7) and duplicate reads marked by Picard 1.92. GATK 3.3.0 was applied for realignment around indels, base quality score recalibration, SNP and indel discovery and joint genotyping. Prior to genotyping, alignment quality was evaluated by Samtools flagstat (7) and Picard tools CalculateHSMetrics and samples with mean target coverage less than 10x were excluded. From this point on, only bi-allelic single nucleotide variants (SNVs) were considered. SNV quality scores were recalibrated using GATK 3.3.0 VariantRecalibrator and filtered at tranche level 99.0. Using VCFtools,(8) genotype calls with depth less than 8 reads and genotype Phred quality score less than 20 were excluded. (9) Sample and variant level quality control Study population genetic structure was analysed by the LASER software using default parameters and the Human Genome Diversity Project (HGDP) as reference population (online supplementary Figure   S7a).(10, 11) Population outliers were defined using the following criteria: 1) study subjects falling more than five standard deviations outside of the mean of the European sub-population of the HGDP reference set were excluded, 2) mean and standard deviation were calculated for the remaining study subjects and any additional subjects falling more than five standard deviations outside of the BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) Relatedness among study subjects was determined using the KING software, applying default thresholds for duplicate and first degree relationships. (12) Extreme sample outliers were identified based on several quality control (QC) measures, as suggested by Do et al. (13) These QC parameters included rate of missing data, heterozygosity ratio, transition-transversion ratio and singleton counts.
Further, samples were excluded if they exhibited discordance between reported sex and that inferred from sequence data or if they exhibited discordance between genotypes inferred from sequence data and a genotype dataset from a previous study. (14) Lastly, it was required that samples had a minimum call rate of 80%.
A number of filters were applied to exclude low quality variants. Heterozygous calls were included only if their allelic balance across all samples was between 0.2 and 0.8. Positions deviating from Hardy-Weinberg equilibrium (P <1x10 -6 , calculated on controls) were excluded as well as monomorphic sites. Finally, a minimum of 90% variant call rate was required. The remaining variant positions were investigated for differential missingness between cases and controls using PLINK (15), and significantly different variants were excluded (P <0.05 Bonferroni corrected). An overview of the QC steps can be found in online supplementary Figure Figure S8).

Variant level annotation
Variant annotation was performed using SnpEff v4.2.(16) Non-synonymous variants were defined as SNVs annotated as missense or nonsense variants. Non-coding SNVs were defined as SNVs annotated BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)  (17) Evolutionarily constrained positions were defined as having a Genomic Evolutionary Rate Profiling (GERP) rejected substitutions (RS) score >2. (18) In analyses of rare SNVs, variants with MAFs <0.01 were included, and for common SNVs variants with MAFs ≥0.05 were included.
Aggregate association testing was performed separately for each variant-set using SKAT-O with the inclusion of the first three principal components generated in EIGENSOFT as covariates. (32) We employed a weighted linear kernel using the default weights as calculated internally by the beta distribution with parameters a=1 and b=25, giving higher weight to rare variants. To ensure reproducible outcomes we set the random number seed value in R to 1,337 before running SKAT-O.
For P-value calculation we used the "hybrid" approach that selects the optimal method based on the

Risk scores and cluster analysis
Cumulative pathway SLE polygenic risk scores (pathway PRSs) were assigned to each individual based on SNVs associated with SLE at nominal significance (P <0.05) in the SLE case-control single variant association study. The Plink function "clump" was used to remove SNV in high LD (r 2 > 0.2) within 250 kbp and to only retain those variants with the highest phenotype association. 1,296 SLE associated SNVs were retained. Then, for each SNV, the natural logarithm of the OR for SLE susceptibility was

Replication study and meta-analysis
The replication study included Norwegian and Danish SLE cohorts recruited at the Oslo University Hospital, Rigshospitalet in Copenhagen, Odense University Hospital and Aarhus University Hospital.
Only SLE patients fulfilling the ACR SLE classification criteria and of self-reported European ancestry were included in association analyses. Norwegian and Danish control individuals from the University Hospitals in Stavanger, Bergen, Odense, Aalborg and Rigshospitalet in Copenhagen were also included. All subjects provided informed consent to participate in the study, and the study was approved by the regional ethics boards. 20 SNVs representing association signals at three loci (CAPN13, IFNK/MOB3B, HAL) or their proxies (LD r 2 ≥0.99) were either genotyped or extracted/imputed from existing sequencing or GWAS array data.
Genotyping was performed using the iPLEX chemistry on pipeline.(40) Imputed genotype calls with genotype probabilities below 0.9 were set to missing and SNVs with a MAF below 0.01, a Hardy-Weinberg equilibrium P<0.0001, a call rate below 95% or an imputation probability score below 0.8 were removed, as were individuals with a call rate below 95%. The Swedish SLE case-control study was expanded to include genotypes from an additional 1,000 control individuals from the SweGen project version September 4 th , 2017 generated by Science for Life Laboratory.(23) Genotypes for proxy variants that were part of the replication study genotyping, but which were not directly called in the targeted sequencing data, were imputed and qualitycontrolled as above. Single variant association analyses were performed separately for the expanded Swedish, the Norwegian and the Danish case-control studies in PLINK using a logistic regression model. Meta-analysis of the three association studies results was performed in PLINK assuming a random effects model. The meta-analysis included a total of 1,794 Scandinavian SLE patients and 3,241 control individuals.