Three articles1, 2, 3 in the present issue of the journal expand on an emerging theme in autoimmunity genetics, the overlap in genetic effects of common variants in disparate diseases. The articles also raise other common issues in our approach to both autoimmune genetics and the genetics of other complex diseases: overlap in cases and controls, population stratification, correction for multiple comparisons (thresholds for significance) and combined versus individual publications.

First, the practical issue of overlap between studies. Multiple studies in autoimmune diseases conducted over the last year have used both case and control sets that in part overlap. This phenomenon while not new to type 1 diabetes (T1DM), where international efforts have lead to widely available patient samples, has been extended to recent high-profile studies including rheumatoid arthritis (RA),4, 5, 6, 7 systemic lupus erythematosus (SLE),8, 9 and in the present issue of the journal, multiple sclerosis (MS). It is often difficult to combine such studies even when there is open communication among research groups due to the practical necessity of appropriately distributing credit to individual investigators and groups. However, such overlap affects interpretation of the results, that is, the concordance of results is not an independent confirmation and it may be difficult for outside groups to combine results in, for example, meta-analyses. If the genotyping results can be combined (that is, they are using the same allele genotyping definitions), the statistical issues posed by this problem can be readily addressed. While the combined single publications may be preferable for journals and the readers in the scientific community, we believe that the present reports are worthy of three individual articles, because they highlight and report results from different populations and different overlaps in genetic etiology among autoimmune diseases.

The very large overlap between the IMSGC study2 and that from Hafler et al.1 (see Table 1 footnote in IMSGC study), indicates that the results of these two studies cannot be considered as replicates. In fact, the combined results for the two groups showed only a marginal gain in significance for the CD226 SNP (single-nucleotide polymorphism) rs763361 (overall P-value 1.1E−08 compared to 5.4E−08 in the IMSGC study2) (S Sawcer, personal communication). In contrast, the two groups (IMSGC2 and Zoledziewska et al.3) reporting the association of CLEC16A polymorphisms did not have overlapping subjects. For these studies, the cumulative results were assessed by combining P-values, as the SNPs analyzed did not overlap. The combined P-value 5.0E−19 using Fishers method (S Sawcer, personal communication) does provide additional confidence in this association. Another approach that can be taken for combining results from independent studies that use different SNPs is to infer SNP genotypes that are common to both studies28 but this may be difficult when a unique population (in this case the Sardinian population) may have different undefined haplotypes, making imputation problematic.

Table 1 Partial list of genetic associations implicated in multiple autoimmune diseasesa

Two of the present studies raise another issue regarding overlap: the use of a single control group for two different disease studies. In the studies reported by Hafler et al.1 and the IMSGC,2 the population control subjects overlap with those used in the study of T1DM. Similarly, use of the same common controls may be a partial concern in evaluating the strength of association between STAT4 and primary Sjogrens syndrome, where the controls overlapped with those used in RA and SLE studies.29 These studies raise the question of how can we adjust our thresholds for significant results to account for these overlaps in which the controls for multiple studies are not independent? This issue may become increasingly problematic as large genotyping sets become publicly available and are used in many studies. In fact, at present several thousand population sets are potentially available from multiple sources (for example, iControlDB (http://www.illumina.com/pages.ilmn?ID=231) and dbGaP (http://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?login=&page=login). If the population groups are large enough then perhaps the issue might become moot, as the genotype frequencies may accurately reflect the population and have little fluctuation when issues of population structure and substructure are addressed. For this approach to be valid, however, studies should begin to adopt a common means for scoring individuals according to genetic background. For example, the recent publication of Nelson et al.30 described the development of a large collection of controls that identified eight eigenvectors for defining major ethnic ancestries as well as more minor ancestry, such as the north–south European cline. Variation within Europe can be further subdivided,31 and ongoing studies are most likely to provide additional reference populations and ancestry marker sets when genome-wide studies are not being performed.32 In the short term, some care will be necessary in considering these issues. If comparison of shared and non-shared control population genotypes shows no substantive difference, there is some assurance that the results do not just reflect some demographic or genotyping artifact among certain control subject collections. Thankfully, for the present studies, similar results were obtained from multiple independent sample groups lending support that the findings are real.

Another analytic issue raised by these studies is the often discussed question of appropriate statistical thresholds. For a ‘candidate’ SNP(s) study what is a reasonable P-value. For example, how do we assess the P-value of 6.7E−5 reported in the Sardinian study? The Hapmap consortium33 found that within each 500 kb region, there were the equivalent of 150 independent allele-based tests in Caucasian and Asian populations and about 350 independent tests in Yorubans. Correcting for the number of independent tests leads to a thresholds of around 5E−8 for Caucasian and Asian populations and 2.5E−8 for Yorubans. These highly conservative thresholds may be appropriate for genome-wide analyses but seem excessively conservative when there is excellent motivation to test specific SNPs. Adjusting for the total number of SNPs that have been tested would be one easily applied alternative approach to limit excess false positives, but extracting the total number of SNPs that have been tested from investigators can be difficult, in part, because of shifting priorities in laboratories. This problem is further compounded by studies performed by multiple different laboratories and the bias towards only reporting positive results, which particularly affects smaller studies.34 Applying a false-discovery paradigm is an alternative approach, which seems appealing, although this approach has not yet been widely adopted35, 36 and studies should report false discovery rates along with significance levels. Another alternative approach would be to weight the previous information and obtain a posterior probability of association allowing for the cost of false-negative and false-positive discoveries, using a Bayesian approach.37 This approach has an advantage of incorporating uncertainty about the reliability of previous information.

Finally, how does the overlap in susceptibility alleles in different autoimmune diseases define risk or insight into the pathogenesis of myriad diseases? A wealth of studies has clearly shown that allelic variations at the same or closely linked loci are critical genetic risk factors for multiple autoimmune diseases. A partial list of the more cogent overlaps for non-major histocompatibility ‘genes’ is shown in Table 1. For some, the same haplotype or even same putative causative amino acid variation appears to be implicated (for example, haplotypes for STAT4 in SLE and RA, and the PTPN22 Arg620Trp variation in T1DM and RA). However, for other implicated genes (or small genomic intervals) the story is much less clear. This includes TNFAIP3, where it appears that at least one component of the RA risk is different from the SLE risk haplotype.17, 18, 19

The shared risk factors appear to have unique overlaps between different autoimmune diseases. For example, the PTPN22 Arg620Trp that is sheared between several autoimmune diseases including T1DM and RA has been shown not to be a risk factor in MS, whereas in the present studies CD226 and CLEC16A variations are shared risk factors between MS and T1DM. Interpreting these relationships may also be further complicated by ethnic differences in the frequency and/or risk of particular variants for these diseases.

Moving forward, we will need much clearer definition of the risk associated with each of the genetic variations in different autoimmune diseases within different population groups. This information can also further draw on the emerging integration of expression information and molecular pathways. A combination of an understanding of how genetic variation affects these pathways will most likely provide a clearer understanding of pathophysiology of these complex diseases. In this regard, a recent review providing a diagram of potential synthesis of SLE and RA molecular mechanisms/pathways may be instructive.23 Arguably, even the modest effect (amount of genetic variation explained) of many of the emerging genetic risk factors will be critical in our understanding of the etiopathogenesis of these diseases. Some caution is of course needed, as in many cases, the actual gene(s) affected by specific sequence or haplotype variations is not yet clear. This may be the case for the STAT4 association in RA and SLE, where the present paucity of functional information has not excluded that the associated sequence variation could be affecting the transcription of the closely linked STAT1 gene even though the responsible haplotype resides within the STAT4 genomic region.

The present studies in the journal provide additional information that may facilitate an understanding of the intersection of molecular pathways resulting in MS and T1DM. However, at present the limited understanding of the role of the CD226 and CLEC16A variants precludes strong speculations. For CD226, the Glys307Ser mutation can explain the present associations and may allow focused experimental studies to determine the altered mechanism in T cells, B cells or natural killer cells that presumably predisposes individuals for MS and/or T1DM. For CLEC16A, the lack of a clear functionally relevant variation does not at present exclude the possibility that the functional SNP(s) could, in possible analogy to the situation discussed for STAT4, be important in the regulation of the closely located MHC2TA gene. Thus, the difficult step of defining the functional mechanisms by which the variations modify risk is a critical bottleneck in developing cogent hypothesis to explain the common roles of specific gene variations in immunity. However, the common genes for autoimmune diseases will most likely provide important insights into the complex interactions that result in aberrant immunologic activity causing autoimmune disease, and this is most likely to be a recurring theme in Genes and Immunity.