Article Text

Download PDFPDF

Genetics of rheumatoid arthritis: 2018 status
  1. Yukinori Okada1,2,
  2. Stephen Eyre3,
  3. Akari Suzuki2,
  4. Yuta Kochi2,
  5. Kazuhiko Yamamoto2
  1. 1 Department of Statistical Genetics, Osaka University Graduate School of Medicine, Osaka, Japan
  2. 2 Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
  3. 3 Division of Musculoskeletal and Dermatological Sciences, School of Biological Sciences, The University of Manchester, Manchester, UK
  1. Correspondence to Dr Kazuhiko Yamamoto, Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan; kazuhiko.yamamoto{at}


Study of the genetics of rheumatoid arthritis (RA) began about four decades ago with the discovery of HLA-DRB1. Since the beginning of this century, a number of non-HLA risk loci have been identified through genome-wide association studies (GWAS). We now know that over 100 loci are associated with RA risk. Because genetic information implies a clear causal relationship to the disease, research into the pathogenesis of RA should be promoted. However, only 20% of GWAS loci contain coding variants, with the remaining variants occurring in non-coding regions, and therefore, the majority of causal genes and causal variants remain to be identified. The use of epigenetic studies, high-resolution mapping of open chromatin, chromosomal conformation technologies and other approaches could identify many of the missing links between genetic risk variants and causal genetic components, thus expanding our understanding of RA genetics.

  • gene polymorphism
  • rheumatoid arthritis
  • autoimmune diseases

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Rheumatoid arthritis (RA) is an inflammatory rheumatic disease that causes chronic synovial inflammation, eventually leading to disabling joint destruction as well as systemic complications.1 Most epidemiologic studies indicate that the prevalence of RA is 0.5%–1.0%. Between 70% and 80% of patients with RA have autoantibodies such as rheumatoid factor and anti-citrullinated protein antibodies (ACPA), suggesting that RA is an autoimmune disease.1

The majority of rheumatic diseases involve complex traits in which multiple genetic and environmental factors interact. Twin studies have estimated that the heritability of RA is ~60%.2 This applies primarily to patients with RA who are positive for ACPAs, whereas the heritability of seronegative RA appears to be lower. Since 2007, rapid advances in genome-wide association study (GWAS) technologies have facilitated the identification of hundreds of genetic risk factors for many complex diseases.3 To date, more than 100 genetic loci have been associated with RA.4 However, the relationship of these loci to the disease remains to be elucidated.

As a genetic factor has a clear causal relationship to RA, it is important to understand the pathologic process from a genomic standpoint. Recent studies of complex trait diseases have indicated that many disease susceptibility variants regulate the expression levels of a number of genes that function in a cell-specific manner.5 Furthermore, the epigenome is thought to play an important role in this phenomenon. Obtaining a more thorough understanding of this complex regulatory network is vital to determining which genes and cell types play pivotal roles in RA, thus helping to identify key pathways that drive RA and enable stratification of patients into groups based on the causative pathways. Here, we describe the state of genetic research to date, envisaging a better understanding of the pathogenesis of RA.

The current status of RA genetics

Studies investigating the correlation between variations in human genome sequences and RA case–control phenotypes have identified a number of genetic variants associated with RA susceptibility. Here, we briefly review the history of RA genetics research (table 1). The RA risk locus was first identified in around 1980, and this research elucidated the role of HLA-DRB1 alleles in the major histocompatibility (MHC) locus.6 In the early 2000s, the International HapMap Project consolidated the map of human genome sequence variations in multiple populations,7 which enabled the unbiased genome-wide screening of genetic variants (mostly represented as single nucleotide polymorphisms (SNP)) associated with human phenotypes.8 In a visionary early GWAS of RA,9 10 PADI4 was identified as an initial non-MHC RA risk locus in the Japanese population.9 Then, large-scale GWAS using commercial microarrays were conducted for a wide range of human complex traits including autoimmune diseases and RA.11–13

Table 1

Overview in history of RA genetics discovery

While early RA GWASs were conducted separately for each cohort, association signals in the RA risk loci were largely replicated among the multiple cohorts,14 15 suggesting that meta-analyses of multiple GWASs would increase the statistical power. By applying in silico imputation of SNPs not genotyped in the GWAS data with an independent reference panel with high-density SNPs,7 16 genotype data for a unified set of millions of genome-wide SNPs can be obtained, which were used for the GWAS meta-analysis.17 Since 2010, several collaborative efforts have been initiated with the goal of organising data from multiple RA GWASs and meta-analyses of these data identified a number of RA risk genes.18–20 These initiatives also contributed to the construction of a reliable network within the community of RA genetics researchers.

The accumulation of RA GWAS meta-analysis results in each of these populations provided evidence of a shared genetic background among patients with RA in different populations. The researchers found multiple genetic loci that confer significant RA risk in multiple ancestry (eg, reidentification of PADI4 as an RA risk locus in Europeans)20 and its interaction with population-specific environmental factors.21 Further, numerous genome-wide SNPs weakly but significantly shared disease risk between the populations.19 22 It was also suggested that single-population GWAS were mostly underpowered, and transethnic data integration was warranted to gain statistical power to identify the trait-associated loci and to unveil hidden disease aetiology. These observations provided strong motivation for a transethnic study to integrate multiple populations.18–25 Thus, we conducted an initial transethnic RA GWAS meta-analysis involving >100 000 subjects from European and Asian populations.4 This study identified >100 RA risk genetic loci, demonstrating the value of human transethnic GWAS. Figure 1 shows a list of presumed RA risk genes identified to date, grouped according to chromosomal position. This plot helped us visually grasp the ‘landscape’ of RA genetics; that is, the studies in European and Asian populations both substantially contributed to the identification of RA risk genes. However, we must note that more recent transethnic studies have focused only on Europeans and Asians (mostly from East Asia), and coverage of worldwide populations remains limited. Nevertheless, the current transethnic GWAS are still underpowered to dissect overall genetic architecture of human complex traits.26 Future efforts should therefore integrate other ethnicities.27–30

Figure 1

A current catalogue of rheumatoid arthritis (RA) risk gene loci. A list of RA risk 106 gene loci identified to date along with a Manhattan plot of the transethnic genome-wide association study (GWAS) meta-analysis of RA.4 18–20 22–25 The significance of each single nucleotide polymorphism (SNP) in the RA GWAS is indicated on a logarithmic scale (X-axes). RA risk genes initially identified in the European, Asian and transethnic studies are coloured red, blue and green, respectively.

Another approach to expand understanding of RA genetics is the use of cross-trait analyses to identify genetic correlations with other human complex traits. It has been reported that genetic variants associated with RA likely also confer risk to other diseases (ie, pleiotropy).31 This includes autoimmune diseases, and allergic diseases, biomarkers (eg, neutrophil count and C-reactive protein) and cancers.4 An interesting approach for a GWAS meta-analysis examining RA and coeliac disease identified a number of loci with pleiotropic effects.32 A recently developed method, polygenic analysis, enables evaluation of the top-associated variants and numerous genome-wide SNPs with relatively small effect sizes.33 Specifically, the introduction of the polygenic risk score (PRS)34 and linkage disequilibrium score regression (LDSC)35 methods enabled quantification of genetic correlations among different phenotypes. PRS calculates a disease risk score for each subject included in the GWAS by integrating the genotype with the corresponding susceptibility risks of genome-wide SNPs. The LDSC method does not assess an individual’s genotype and instead uses summary statistics of genome-wide SNPs in the GWAS results (ie, ORs and p values) and estimates shared genetic backgrounds between phenotypes and their cell-type specificity. Application of the PRS and LDSC methods pointed to a negative genetic correlation between RA and schizophrenia,36 37 which could partly explain the relatively lower comorbidity between these two diseases previously highlighted by epidemiologic studies.38

MHC complex: genetics and biology

The MHC region at chromosome 6 confers a distinctive and strong genetic risk when compared with other RA risk loci, explaining 30%–50% of total genetic risk of RA.39 Within the MHC, a class II classical human leucocyte antigen (HLA) gene, HLA-DRB1, explains the majority of RA risk. Which combinations of HLA-DRB1 variants could best explain disease risk has been a long-standing subject of debate. In 1987, the shared epitope (SE) hypothesis was introduced to show the risk associated with the specific amino acid sequence at positions 70–74 of HLA-DRβ1.6 Although the SE hypothesis was supported by data from populations around the world, there was controversy regarding the RA risk associated with non-SE HLA alleles (eg, HLA-DRB1*09:01 in Asians).40 GWAS have identified that ACPA-positive and ACPA-negative RA showed dramatic difference in genetic backgrounds as analogous to heterogeneity in clinical manifestations, of which association signals were most apparent at the MHC region.41

The recent development of an HLA imputation method led to fine-mapping of genetic risks within the MHC for a variety of immune-related diseases.42 43 Similar to SNP genotype imputation, the HLA imputation method computationally imputes risk to unobserved HLA gene variants according to neighbouring GWAS SNP genotypes in the MHC and an imputation reference panel (figure 2A). By applying HLA imputation, one can assess both alleles and amino acid polymorphisms of all HLA genes included in the reference panel for all samples with available GWAS data.43

Figure 2

Rheumatoid arthritis (RA) genetic risk in the major histocompatibility (MHC) region revealed using the human leucocyte antigen (HLA) imputation method. Illustration of the roles of the MHC complex region and human leucocyte antigen (HLA) genes in RA genetics. (A) The MHC region at 6p23 harbours numerous immune-related genes, including HLA genes. One can computationally estimate genotypes of the HLA variants using the HLA imputation method without any additional cost other than that associated with single nucleotide polymorphism (SNP) microarray typing. (B) Amino acid polymorphisms at specific positions in the classical HLA genes confer risk of RA (eg, positions 11 and13 at HLA-DRβ1), which are generally shared among multiple populations (Asians (ASN), Europeans (EUR) and Africans (AFR)). It is interesting that different residues at the same amino acid positions confer differential risk of anti-citrullinated protein antibody (ACPA)-positive and negative RA. (C) Dosage change in the non-classical HLA gene also confers risk of ACPA-positive RA.

Application of HLA imputation to large-scale RA GWAS data produced several interesting findings. (1) Most of the risk of ACPA-positive RA could be explained by amino acid polymorphisms at positions 11 and 13 of HLA-DRβ1, rather than at the well-known positions 71 and 74 as implicated by the SE hypothesis (figure 2B).44 Residues at HLA-DRβ1 positions 11 and 13 tag several SE alleles such as HLA-DRB1*01 and *04,45 implying that the amino acid model could be interpreted as extension of the SE hypothesis. (2) Although the MHC-associated genetic risks of ACPA-positive and ACPA-negative RA were found to be heterogeneous, they could be explained by the same HLA-DRβ1 amino acid positions but different risk-conferring residues.46 This may suggest other autoimmune responses other than ACPA contributes to ACPA-negative RA. (3) Risk HLA variants were found to be shared among populations more than expected (eg, Europeans, Asians and African Americans), which closed the debate regarding risk and ethnic heterogeneity in HLA alleles.45 47 (4) In addition to HLA-DRB1, amino acid polymorphisms in other classical HLA genes, such as HLA-DPB1, HLA-B and HLA-A, confer risk for ACPA-positive RA.44–46 (5) Finally, a coding variant in a non-classical HLA gene (HLA-DOA) that alters the gene’s expression level also confers risk for ACPA-positive RA (figure 2C).48

Next-generation sequencing (NGS) technology represents a promising tool for use in future MHC fine-mapping studies.49 Current imputation reference panels include limited numbers of classical HLA genes. However, the MHC includes a number of HLA-related genes, including non-classical HLA genes (eg, HLA-E/F/G), HLA-like genes (eg, MICA) and pseudo-HLA genes (eg, HLA-DRB6), as well as key immunity-related genes (eg, TNF and C4A-C4B). NGS-based approaches could identify variants with higher resolution, thus warranting their incorporation in reference panels.

What can we learn from genetics of RA?

Missense variants, which alter the amino acid sequence of a coding gene, are common functional variants that can be pathogenic. The most important and well-characterised missense risk variant in RA may be PTPN22 R620W (1858C→T) although this risk variant is extremely rare in east Asian populations.50 PTPN22 encodes a protein tyrosine phosphatase that is expressed in haematopoietic cells and acts as a negative regulator of antigen receptor signalling in T and B cells.51 The risk allele 620W is a gain-of-function variant, as both TCR and BCR signalling are reduced in cells of risk allele carriers.52 53 This attenuation in antigen receptor signalling affects the clonal selection of lymphocytes and the appearance of autoreactive cells.54 Reduced TCR signalling resulting in impaired regulatory function has also been observed in regulatory T cells.55 Moreover, the PTPN22 variant has also been associated with reduced TLR7 signalling in plasmacytoid dendritic cells56 as well as hypercitrullination in peripheral blood mononuclear cells.57 Thus, the variant’s effect essentially depends on cellular context, and these compound effects in multiple cell types may contribute at each step of pathogenesis. Surprisingly, knock-in mice with the corresponding allele (Ptpn22 619W) exhibit enhanced antigen receptor signalling,58 perhaps due to enhanced degradation of the Ptpn22 619W product.59 The phenotype is contrary to that of human lymphocytes but similar to that of lymphocytes in Ptpn22-deficient mice, indicating that Ptpn22 619W is a loss-of-function variant in mice.

Although functional analysis of missense variants is straightforward, only ~20% of GWAS RA loci encompass coding variants.4 In the majority of GWAS loci, as described above, disease-causing variants regulate gene expression. Indeed, ~50% of RA risk SNPs colocalise with expression quantitative trait loci (eQTL) found in peripheral blood mononuclear cells.4 eQTL are defined as genetic variants that alter the expression levels or splicing patterns of a specific gene. Missense variants in the risk loci are sometimes not causal but simply tag the true causal variants having eQTL effects. Therefore, it is rational to integrate data from GWAS and eQTL studies to elucidate the disease mechanism. This approach enables connection of a GWAS variant to a responsible gene and identification of the responsible cell types, as observed eQTL effects are sometimes cell-type specific (figure 3A). Furthermore, eQTL variants can also indicate the direction of gene regulation (figure 3B). For example, upregulation of eQTL risk genes such as STAT4 60 and CCR6 61 has been linked to upregulated production of inflammatory cytokines in patients. As individual effects of eQTL variants are small, examining the overall effects of eQTL variants in the disease was attempted by combining the whole data sets from a GWAS and an eQTL study.62 This analysis evaluated the effects of multiple variants on inflammatory cytokine pathways in each lymphocyte subset and demonstrated that polygenic effects of eQTL genes upregulate the tumour necrosis factor (TNF)-α pathway in CD4 + T cells.

Figure 3

Integration of GWAS and eQTL data enhances understanding of the disease mechanism. (A) Identification of disease-associated genes and cells in the GWAS loci using eQTL data. In this case, the eQTL SNP, which is in linkage disequilibrium with the GWAS SNP, has an eQTL effect for gene B (the responsible gene) in CD4+T cells (the responsible cells). (B) Polygenic effects of disease-associated eQTL genes on disease-related cells and pathways. Upregulated genes are eQTL genes whose expression is upregulated in the risk allele and could be targeted by an antagonist (vice versa for downregulated genes). eQTL, expression quantitative trait loci; GWAS, genome-wide association studies; IFN, interferon, IL, interleukin; SNP, single nucleotide polymorphism; TNF, tumour necrosis factor.

While researchers believe that disease risk variants are responsible for the heterogeneous aetiology of RA, the use of genetic data in the prediction of clinical phenotype is challenging. Previous GWAS examining the response to biologics (mainly anti-TNF therapy) provided unsatisfactory evidence,63–67 which may suggest that genetic background of RA onset and that of clinical response are distinct. A crowd-sourced collaborative assessment of SNP data to predict anti-TNF treatment response was performed.68 However, no significant genetic contribution towards prediction accuracy has been obtained.

Further genetic assignment of RA susceptibility

A surprising GWAS finding is that approximately 80% of RA risk variants occur in non-coding regions (figure 4A).4 This physical location effect, coupled with data demonstrating that associated SNPs are actually highly correlated with an often large number of other variants, means that GWAS have not conclusively identified a causal gene or causal variant.

Figure 4

Roles of coding and non-coding risk variants in the genetics of RA. (A) Of the >100 RA risk variants, only 20% are attributable to coding variants such as PTPN22 (R620W) and others.4 To understand the roles of the remaining 80% of the risk variants, we need to integrate RA GWAS results with a variety of omics layer information constructed using the latest technologies (eg, eQTL study involving RNA-seq for exploring gene expression profiles). (B) Demonstrating how omics data can be incorporated into GWAS findings to give an insight into the causal variant and gene. The lead GWAS single nucleotide polymorphism (SNP) (purple dot, row 1) is highly correlated with a number of other SNPs (red dots, row 1) each equally likely to be causal. These SNPs are in an open region of chromatin (yellow, row 2), marked by modified histones (blue, orange and green stars, row 2). The region is open (DNaseHS peaks (purple), row 3), and flanked by modified histones (blue peaks, row 4). This open region (yellow, row 2) is interacting (CHiC, pink bars, row 6) with an expressed gene (pink peaks, row 5). This is not found in a different cell type (row 7). This provides evidence for gene and cell type implicated by the disease GWAS findings. ChiP, chromatin immunoprecipitation; eQTL, expression quantitative trait loci; GWAS, genome-wide association studies; RA, rheumatoid arthritis.

Studies are now well under way to provide this missing link to SNP, gene and mechanism.69 As discussed, eQTL studies can correlate a specific associated allele with gene expression or isoform splicing pattern, linking GWAS variants to causal genes with implication in functional aetiology, in particular cell types and stimulatory conditions. In addition, epigenetics and molecular biology are now routinely employed to annotate putative GWAS causal SNPs to active DNA and genes. Resources such as Encyclopedia of DNA Elements (ENCODE),70 Roadmap71 and Blueprint72 use techniques such as DNaseHS and ATAC-seq to map open active DNA in a range of cell types. These technologies distinguish active, open chromatin from the active form due to the fact that open chromatin is more readily cut (DNaseHS) or more readily accessible (ATAC-seq), creating sequencing libraries enriched for open DNA. Histone modifications, markers for gene regulators, can be mapped in different cell types using chromatin immunoprecipitation (ChIP). Histones are either methylated or acetylated when demarcating chromatin states. Antibodies against these marks of activation, for example, H3K4me1 or H3K27ac, are therefore used on fixed DNA to precipitate out active histones, and associated DNA. Sequencing of these regions (ChIP-seq) can identify all active regions within different cell types and states. Interaction between disease-associated variants and their target genes, often situated a long distance in a linear view of the chromosome, can be mapped by chromosome conformation technologies, such as 3C, Hi-C and Capture Hi-C73 74 (figure 4B). Here the DNA is first fixed within the nucleus in its three-dimensional (3D) conformation. The DNA is then cut, and then reannealed, such that regions that are close together spatially are allowed to join together. Sequencing these reformed regions gives an indication of how close the DNA was in 3D space, such that enhancers are linked to genes through physical interaction. In this way, using both public data and data generated by individual laboratories, it is possible to determine the associated SNPs that are in active regions, the cell types in which this occurs, the gene that is interacting with the region and the effect on expression (figure 4B). For example, GWAS has implicated T cell involvement via their physical location in two ways: they are proximal to genes important in T cell immunity, such as HLA, PTPN22, STAT4, TRAF1 and IL2RA; and studies have shown that RA genetic variants are enriched in regions that are open and active in T cells.69

Further validation is now being provided by studies employing genetic engineering techniques, specifically CRISPR technology, to directly perturb the DNA regions associated with diseases and measure downstream consequences on cell phenotype. In this way, we are beginning to understand the molecular consequences of carrying a disease risk variant, whether it upregulates or downregulates a gene or pathway, whether this is dependent on cell type or stimulation, and the ultimate consequences on immune function.75


As described above, tremendous progress has been made in understanding the genetics of RA. However, we have yet to fully elucidate the pathogenic mechanism of RA. Genetics-based drug response prediction is currently challenging, in part because the responsible variants and genes are not fully understood. Therefore, we need to move forward applying the various approaches to identify the missing links.

In addition, the problem of missing heritability remains to be solved. GWAS are based on the ‘common disease, common variant’ hypothesis, which posits that common diseases are attributable to allelic variants present in more than 1%–5% of a population. Rare variants may also contribute to missing heritability, but few examples have been identified thus far. Data from a GWAS of rare variants are difficult to analyse statistically due to the low frequency of occurrence, despite the strong effect. To understand which risk variants are associated with pathogenesis, statistical associations between genetic variation and disease must be linked to functionality and causality. Alternatively, the accumulation of data from ongoing whole-genome sequencing studies could overcome these problems in the near future.



  • Handling editor Josef S Smolen

  • Contributors All authors (YO, SE, AS, YK and KY) wrote a part of the manuscript and YO and KY have integrated them. All have reviewed and approved the final manuscript.

  • Funding This study was funded by Japan Society for the Promotion of Science (grant number 18H05285), Japan Agency for Medical Research and Development (grant number 15545436) and Takeda Pharmaceutical.

  • Competing interests KY has received honoraria or research grants from AbbVie, Astellas, AYUMI, Bristol-Myers Squibb, Chugai, Eisai, Janssen, Mitsubishi Tanabe, Ono and UCB. YK has received research grants from Takeda Pharm, Chugai Pharm and Pfizer.

  • Patient consent for publication Not required.

  • Ethics approval RIKEN Ethical Committee.

  • Provenance and peer review Commissioned; externally peer reviewed.