Introduction

Population-based association studies provide a powerful approach to the identification of susceptibility genes underlying common diseases.1, 2 A very large amount of information about genetic variants in the human genome has been accumulated through the International Human Genome Sequencing Project and the International HapMap Project.3, 4, 5, 6 Combined with the establishment of high-throughput single-nucleotide polymorphism (SNP) typing systems, genome-wide association (GWA) studies have been widely applied.7 Accordingly, gene–disease associations have been reported.

Replication studies were extensively implemented to establish the credibility of the initial positive findings. However, comprehensive reviews of the published literatures in the era of the candidate gene approach show that most of the initial positive associations were not reproduced in the subsequent replication studies.8, 9, 10, 11, 12, 13 These findings suggest that a large number of original findings were false-positive reports and another possibility is that most of the studies were underpowered to detect small genetic effect.8, 9 Furthermore, inconsistency or between-study heterogeneity of results of genetic associations can be observed regardless of whether the associations are true or not,10, 14 and it may be attributed to population stratification, genotyping errors, differences in the pattern of linkage disequilibrium (LD) structure and other factors.15, 16 In the era of GWA studies, this problem remains one of the most difficult issues of genetic association studies.10, 15, 16 For example, the large-scale international study of Parkinson's disease failed to replicate 13 SNPs identified by the previous GWA study.17

In these circumstances, meta-analysis can be a useful tool to combine both statistically significant and nonsignificant results from individual studies on the same research question. In case–control study, the odds ratios (ORs) for individual studies are combined to calculate a summary OR. Meta-analysis improves the estimation of a summary OR and 95% confidence interval (CI) and increases the statistical power to detect gene–disease associations.18 Therefore, conclusions from a meta-analysis are more robust than those from a single small study. In addition, meta-analysis is useful to investigate the consistency or heterogeneity of the associations across studies. Testing for and quantifying between-study heterogeneity is an important aim of meta-analyses to determine whether there are differences underlying the results of the study.19, 20 Addressing the observed between-study heterogeneity could generate a new insight into the gene–disease association.20

In this review, we begin with describing the process of meta-analysis of genetic association studies. The statistical backgrounds, methodological issues and sources of between-study heterogeneity of meta-analysis of genetic association studies are briefly reviewed. Finally, we present the results of our simulation study to illustrate the effect of between-study heterogeneity on conclusions of meta-analyses.

Literature-based meta-analysis

In a basic meta-analysis, data are retrospectively collected from published literatures to assess whether a gene–disease association of interest is true or not.18 When planning a meta-analysis, it is important to define precise search strategy beforehand.21 If relevant studies are excluded or inadequate studies are included, conclusions of the meta-analysis may be biased.22 The literature search is conducted in databases such as PubMed and EMBASE. The HuGe Published Literature database (http://www.cdc.gov/genomics/hugenet/) is also useful, as it includes published literatures on genetic associations and other human genome epidemiology.23 It is important to collect the largest possible number of studies; therefore, we should use appropriate key words. Once the search has been completed, bibliographies of retrieved articles should be examined for further relevant publications.

These processes make up the essential part of the methods section of a meta-analysis, because literature-based meta-analysis is subjected to bias caused by difficulty to identify and include all conducted and relevant studies,13, 24 and small difference in selected literatures may alter conclusions of meta-analyses on the same genetic association.25 However, the essential features of the search strategy have not fully reported in most meta-analyses of genetic association studies.26 In order to avoid such biases, it may be recommended to have two or more different researchers conducting the same search.21 When conducting and reporting a literature-based meta-analysis, flowchart detailing the exclusion and inclusion criteria and the number of studies excluded and included at each step of the literature search is useful (Figure 1).

Figure 1
figure 1

Flowchart detailing the exclusion and inclusion criteria and the number of studies excluded and included at each step of the literature search.

Meta-analysis of genetic association studies may be subjected to publication bias.18, 26 Publication bias tends to occur when small studies showing negative or nonsignificant results remain unpublished and may result in the overestimation of the genetic effect. If the presence of publication bias is suspected by statistical tests,27, 28 conclusions from the meta-analysis should be cautiously reported and the potential impact of the publication bias should be mentioned.18

The results obtained from the meta-analysis would be assessed by the following: (i) the size of the summary OR; (ii) the extent and possible cause of between-study heterogeneity; and (iii) the sufficiency and stability of the meta-analysis by using the cumulative and recursive cumulative meta-analysis approaches.29, 30, 31 In the cumulative meta-analysis, studies are sorted chronologically and a summary OR is calculated when a new study is added.29 As a result, we can present how the summary OR has shifted over time. The recursive cumulative meta-analysis is an extension of the cumulative meta-analysis, where the relative change in the summary OR by adding a new study is evaluated.30, 31

Consortium-based meta-analysis

Consortium-based meta-analysis is the meta-analysis of individual patient data through the collaboration of consortium of investigators. Consortium-based meta-analysis attains increased attention,32, 33, 34 because integration of several GWA data sets has been designed and new susceptibility genes have been discovered.35, 36, 37, 38, 39 Although meta-analysis of GWA studies can be implemented using reported ORs and 95% CIs or P-values from different GWA studies, it is preferable to reanalyze several GWA data sets with individual patient data.35 In the latter case, one can use imputation techniques for missing data when SNPs have been genotyped in some platforms but not in others.40 Barrett et al.39 conducted a meta-analysis of three GWA data sets for Crohn's disease that used different genotyping platforms using imputation methods. The combined GWA data sets included 635 547 SNPs in 3230 cases and 4829 controls. They used the GWA data sets at the screening stage. The power of the meta-analysis was reported to be 0.74 to detect associations with per allele OR of 1.2 and with risk allele frequency of 0.2 at the significance level of P=1.0 × 10−5. The meta-analysis of the GWA data sets and additional replication data sets confirmed 11 previously reported loci and identified genome-wide significant signals for novel 21 loci.

Genetic association study-specific methodological issues

There are methodological issues relevant to meta-analysis of genetic association studies: (i) assessment of Hardy–Weinberg equilibrium (HWE) and (ii) definition of genetic models.

Deviation from HWE in control samples is the most commonly used test for genotyping error.41 However, the test for HWE has relatively low statistical power to detect genotyping error.42 Furthermore, SNPs that are not in HWE can be used for inference about genetic model of disease susceptibility at the locus.43 Although there is no consensus how meta-analyses should handle the studies that are not in HWE, three strategies have been applied: including all studies regardless of departure from HWE,44 performing sensitivity analyses in order to evaluate whether the genetic effects are different between subgroups of studies classified according to test for HWE26, 45, 46, 47 and excluding studies showing statistically significant departure from HWE.18 Reporting the extent of departure from HWE measured by such as α,48 the inbreeding coefficient,49 and the disequilibrium parameter50 is also useful.44

In a genetic association study, subjects are classified into three exposure groups (AA, Aa and aa). Let A be the susceptibility allele, there are several methods of dichotomizing these exposure groups for conducting a meta-analysis:26 by comparing allele frequency, by assuming a specific mode of inheritance (recessive, dominance, complete overdominant or codominant) and by performing multiple pairwise comparisons. All these methods, with exception of the method performing multiple pairwise comparisons, assume a particular genetic model. When performing multiple pairwise comparisons or testing multiple genetic models, results of all analyses undertaken should be reported. In order to choose most likely genetic model describing the genetic architecture underlying a disease of interest, Minelli et al.51 presented a ‘genetic model free’ approach. Their procedure is based on the estimation of the ratio (λ) of the log OR of Aa versus aa compared with the log OR of AA versus aa. λ will be 0 under a recessive model, 0.5 under a codominant model and 1 under a dominant model.

Estimation of a summary OR and test for and measure of between-study heterogeneity

The statistical methods of combining the results of different studies are described. We consider a meta-analysis of k separate genetic association studies to estimate the genetic effect (θ) for dichotomous disease outcome quantified by log OR. Let θi and be the true and observed log OR for ith case–control study, respectively (i=1, … ,k). Let vi denote the variance of , the weight for ith study is given by wi=1/vi (that is, the inverse of the variance). OR for each study is given by ORi=aidi/bici. . vi is defined as vi=1/ai+1/bi+1/ci+1/di, where ai and bi correspond to numbers of affected individuals with and without the susceptible genotype, respectively, and ci and di correspond to numbers of unaffected individuals with and without the susceptible genotype, respectively.

There are two commonly used procedures for combining : ‘fixed effects model’ (FEM) and ‘random effects model’ (REM). FEM assumes that θis are homogeneous across studies (that is, θ1=θ2=…=θk) and all differences are due to chance. Inverse-variance, Mantel-Haenszel52 and Peto's53 methods are commonly used for FEM meta-analysis. Using the inverse-variance method for combining the results across studies, a summary log OR under FEM is calculated as a weighted average of the study estimates: . The variance of is given by .

The assumption underlying FEM should be examined with the test for heterogeneity, Cochran's Q test.54 Test statistics of Cochran's Q test is

Under the null hypothesis of homogeneity (that is, θ1=θ2=…=θk), this statistics approximately follows a χ2 distribution with k−1 degrees of freedom. Cochran's Q test has relatively low statistical power to detect between-study heterogeneity, especially when the number of studies is small;55 therefore, the test is usually preformed at the significance level of 0.1.56

REM assumes that the genetic effects may vary across studies because of genuine difference and/or differential biases. The estimate of the between-study variance (τ2) is included into the weight as . A summary log OR under REM are estimated as follows: . The variance of is approximated as .

In DerSimonian and Laird57 REM meta-analysis, the τ2 is estimated as follows:

When takes negative value. In practice, is used. Therefore, the precision of a summary log OR with REM (1/vREM) can never exceed that with FEM (1/vFEM).

The 95% CI for is given by . Test statistic of test for the genetic effect is given by . Under the null hypothesis, Z follows a standard normal distribution.

Higgins and Thompson58 proposed three criteria (H, R and I2) for measure of heterogeneity, which have following desired characteristics: (i) dependence on the extent of heterogeneity, (ii) scale invariance (that is, comparison can be made across meta-analyses with different scales and different outcomes) and (iii) size invariance (that is, independence on the number of studies included). is the relative excess of Q to its degrees of freedom. Mittlbock and Heinzl59 proposed as a modification of H. is the proportion of between-study variance to within-study variance. In practice, is used. values over 1.0 indicate considerable heterogeneity.59 is the ratio of the standard error of a summary effect with REM to the standard error with FEM. R represents the inflation of the CI for REM compared with FEM. H and R coincide when all studies have equal weight.58 . I2 can take negative value, but is used in practice. I2 represents the proportion of between-study variance to the total variation in study estimates and ranges from 0 to 100%. I2 is most widely used for measure of heterogeneity. I2 values over 50% indicate large heterogeneity.58, 60 Potential drawback of I2 is that CIs are very large, especially when the number of studies is small.61

If heterogeneity is present or suspected by the statistical test or measures, there are several commonly used approaches: (i) performing sensitivity analysis by excluding one or more studies showing outlier effect size, (ii) stratifying the studies into homogeneous subgroups such as racial groups and applying FEM for each subgroup and (iii) implementing REM when observed heterogeneity could not be addressed. Some researchers recommend that the use of REM is preferable compared with FEM, because both models give similar summary effects when there is no between-study heterogeneity, FEM gives narrower CI for summary effect compared with REM when between-study heterogeneity exists and a negative result of test for heterogeneity does not always indicate homogeneity when the number of studies is small.25

Source of heterogeneity

A number of reasons have been advanced for heterogeneity in the genetic effects across the results of various studies.8, 13, 14, 47 False-positive results in the initial studies and false-negative results in small replication studies are implicated as the most likely reasons for non-replications.8, 9, 10, 13, 14 Inconsistency and between-study heterogeneity may be caused because of biases or genuine differences in the genetic effects across populations. We review briefly in this article.

Biases

Differential biases due to population stratification, misclassification of clinical outcome, genotyping error and overestimation of genetic effect in the first study can be sources of between-study heterogeneity.

The presence of population stratification tends to spurious associations. It can be caused when there are undetected genetically different subgroups within a study population and disease prevalence differs among these subgroups.11, 62 The effect of population stratification on the results of genetic association studies is debatable.62, 63, 64, 65, 66 According to systematic reviews of meta-analyses of genetic association studies, it is not so much frequent that difference in racial or ethnic groups could explain heterogeneity.9, 67

Inadequate assignment of cases and controls may cause misclassification bias. Although there is a possibility that misclassification of cases and controls would weaken the gene–disease association, the results of misclassification bias may be modest unless the trait is common.13, 32

Ioannidis et al.10 conducted a systematic review of 36 meta-analyses including a total of 370 genetic association studies. Statistically significant between-study heterogeneity was observed in 14 meta-analyses. Restricting to meta-analyses with at least 15 studies, 7 of 9 meta-analyses showed significant heterogeneity. In 25 or 26 meta-analyses, the first study showed more predisposing or protective OR than subsequent replication studies. Using cumulative meta-analysis plots, the authors depicted the process that strong associations claimed in the first study were regressed toward null associations, as subsequent replication studies were accumulated over time. Similar findings were reported in Lohmueller et al.9 Associations passing predetermined thresholds of statistical significance tend to overestimate the size of the genetic effect, especially when the sample size of the study is small and the threshold is stringent in multiple testing situations.68, 69, 70, 71, 72, 73, 74 Such an upward bias is called as winner's curse phenomenon.9, 69

Genuine differences

Differences in the pattern of LD structure over chromosomal regions of interest across populations are implicated as a cause of between-study heterogeneity in the genetic effects. Zondervan and Cardon75 show that marker allelic OR can vary according to the extent of LD between marker and true disease allele in terms of D′ and according to mismatch between disease allele frequency and marker allele frequency. This issue may be especially pronounced in the GWA settings because the SNPs that most efficiently surrogate the other SNPs in a genomic region with high LD (that is, tag SNPs) rather than putative functional SNPs have been used to increase genome coverage. When the extent of LD between tag SNP and true disease allele varies across studied populations, the observed ORs could vary across studies.

Many common diseases are implicated to have a complex etiology involving multiple genetic and environmental factors including their interactions. Gene–disease associations can be modified when the gene–gene or gene–environment interaction exists. If these interactions are not identified and controlled for, the gene–disease associations would be heterogeneous across populations according to distribution of a genetic variant or prevalence of a particular environmental exposure. It is needed to conduct a consortium-based meta-analysis of individual patient data in large scale to account for gene–gene or gene–environment interactions.47

Simulation study

We conducted a simulation study to illustrate (i) the power of Cochran's Q test, (ii) the properties of measures of between-study heterogeneity (I2 and ) and (iii) the type I error rate and the power of meta-analysis for detecting the gene–disease association in the presence of between-study heterogeneity.

We consider meta-analysis of k case–control association studies to estimate the overall genetic effect (θ; log OR) of disease outcome. The exposure status (AA, Aa and aa) of subjects included in each case–control study are ascertained in the sampling manner outlined below.70 The values y{1, 0} are labels encoding case (1) or control (0). Let A denote the susceptibility allele, we assume the dominant model and then the SNP genotype predictor value x was designed as 1=AA or Aa, 0=aa. Under the assumption of HWE, the frequency of x written as fx is calculated based on the disease allele frequency fA: f1=1−(1−fA)2. The logistic regression model for ith study (i = 1, 2, . . . , k) is produced as follows:

where αi is the intercept and θi is the log OR for ith study. θi is drawn from N (θ, τ2). τ2 is the between-study variance. αi can be calculated by using the equation for the prevalence of the disease . The genotypes of case and control subjects are generated based on the conditional probabilities of x given by y as follows:

For each study, the genotypes of case–control samples were generated and then the OR and its variance were calculated. Then, the ORs for k studies were combined by FEM and REM meta-analyses. Cochran's Q test was conducted and the I2 and were measured.

We considered simple five simulation scenarios of meta-analyses. The description of simulation scenarios is shown in Table 1. The scenarios I, II and III were designed to be same in sample size within each study but different in the number of included studies. In scenarios III, IV and V, numbers of studies were different but total number of case–control samples included in meta-analysis was fixed at 20 000. The pairs of scenarios I and V or II and IV were designed to have the same number of studies but differ in sample size within each study.

Table 1 Description of five simulation scenarios of meta-analysis

We examined 126 parameter combinations for each scenario. The between-study variance (τ2) varied from 0.0 to 0.02 with increments of 0.001. The true summary OR (exp(θ)) was set to be 1.0, 1.4 or 2.0. The disease allele frequency fA was assigned to be 0.1 or 0.3. The disease prevalence π was fixed at 0.01. The values of τ2 were based on the literature values reported by Moonesinghe et al.76 for the confirmed 10 loci in a meta-analysis of three GWA studies of type 2 diabetes.77 Therefore, our simulation would reflect the possible range of between-study variance. For each scenario and parameter combination, 100 000 simulations were carried out.

The empirical power of Cochran's Q test was evaluated by the proportion of the simulation runs crossing the significance level of 0.1 when τ2>0.0. The top row of Figure 2 shows the powers of Cochran's Q test obtained with five scenarios as the function of τ2 when the overall OR=1.0 and fA=0.1 or 0.3. For each scenario, the power increased as τ2 increased. Comparing among scenarios I, II and III, the power increased as the number of studies increased. When total number of case–control samples was fixed (that is, comparing among scenarios III, IV and V), the powers were similar but scenarios with smaller number of studies showed higher power when τ2 was small. When numbers of studies were identical (that is, two pairwise comparisons of scenarios I versus V or II versus IV), meta-analyses with larger sample size showed higher power for the same τ2. The powers obtained with fA=0.3 were higher than those with fA=0.1. For most of our parameter settings, the powers of Cochran's Q test did not reach at 0.8, although the significance level was set to be 0.10.

Figure 2
figure 2

Behaviors of test for and measures of between-study heterogeneity for five simulation scenarios as the function of τ2, the disease allele frequency fA=0.1 or 0.3, and the overall odds ratio (OR)=1.0. The top row shows the power of the Cochran's Q test at the significance level of 0.1. The middle and bottom rows show the means of I2 and , respectively. The lines of for scenarios I, II and III are overlapping. The description of each simulation scenario is in Table 1.

The means of 100 000 simulated values for the measures of heterogeneity (I2 and ) are shown as the function of τ2 when the overall OR=1.0 and fA=0.1 or 0.3 (the middle and bottom rows of Figure 2). In practice, max{0, I2} and are used to restrict the ranges of these measures as positive. As the simulation study of Mittlbock and Heinzl,59 unrestricted values of I2 and were used to obtain unbiased distributions for these measures in this study. These two measures presented monotonic increases as τ2 increased. I2 and increased as the sample size per study increased (scenarios I versus V or II versus IV). The two measures obtained with fA=0.3 were higher than those with fA=0.1. These results indicate that I2 and increased as within-study variance, , decreased. Comparing scenarios I, II and III shows the important difference between I2 and : whereas I2 increased as the number of studies increased, did not change (the lines of for scenarios I, II and III are overlapping in the bottom rows of Figure 2). This suggests that may be a good indicator of comparing the extent of between-study heterogeneity across meta-analyses. Similar results and further discussion are provided by Mittlbock and Heinzl.59 The 95% intervals of simulated I2 and were large, especially when the number of studies is small (Supplementary Figure S1).

The type I error rate in meta-analysis was assessed as the proportion of the simulation runs showing significant summary OR at the significance level of 0.05 when the null hypothesis was true (that is, the true overall OR=1.0). Figure 3 shows the type I error rates of five scenarios when fA=0.1 or 0.3. When there was no between-study variance (τ2=0.0), the type I error rates under FEM were well controlled at 0.05, but REM showed slightly conservative results (the type I error rate≈0.04). As τ2 increased, the type I error rates under FEM rapidly inflated, but those under REM slightly increased. The type I error rates under both models for the same τ2 increased when sample size per study was large or fA=0.3. We should note that the use of FEM could increase the type I error rate even to the extent that the between-study heterogeneity could not be fully identified by Cochran's Q test and two measures I2 and . For example, in case of τ2=0.005 and fA=0.3, the type I error rate under FEM for five scenarios were 8.5–19.2% (Figure 3). For the parameter setting, the powers of Cochran's Q-test were 20.6–48.3%, the means of I2 were −51.9 to 20.8% and the means of were 0.31–1.25 (Figure 2).

Figure 3
figure 3

The type I error rate in fixed effects model (FEM) and random effects model (REM) meta-analyses at the significance level of 0.05 for five scenarios as the function of τ2 and the disease allele frequency fA=0.1 or 0.3. The top and bottom rows show the type I error rates when applying FEM and REM, respectively. The lines of the type I error rate under FEM for scenarios I, II and III are overlapping. The description of each simulation scenario is in Table 1.

The power of detecting a gene–disease association was evaluated as the proportion of simulation runs reaching the significance level of 5.7 × 10−7, assuming the consortium-based meta-analysis of GWA data sets. As shown in Figure 3, applying FEM meta-analysis to heterogeneous genetic associations could lead to false-positive findings; therefore, we considered only REM when assessing the power of meta-analysis. The top row of Figure 4 shows the result, assuming the dominant model and fA=0.1 or 0.3. When the true overall OR=1.4, the power for each scenario gradually decreased as τ2 increased. Comparing scenarios III, IV and V, the decreases in the power for the same τ2 were larger in the scenarios with large sample size per study. While the values of vFEM for scenarios III, IV and V were not different, the values of vREM for scenarios III, IV and V varied when between-study heterogeneity was present. For the same τ2 (>0), the following inequality was true: vREM for scenario V>vREM for scenario IV>vREM for scenario III. When θ≠0, the mean of the distribution of the Z-test under REM is . The power of detecting gene-disease association of effect size of θ is78

where Φ is the cumulative distribution function of the standard normal and Cα/2 is the upper α/2 percentage point of the standard normal distribution. Along with the inequality described above, the decrease in the power for the same τ2 is larger in the scenarios with large sample size per study when the total sample sizes are equal across scenarios. When the overall OR was set to be 2.0, the powers did not so much decrease in the simulated range of τ2. Furthermore, we calculated the mean OR of the simulations passing the genome-wide significance threshold (P-value<5.7 × 10−7). The estimates of mean OR were upwardly biased, especially in scenarios whose powers of detecting gene–disease associations were low (the bottom row of Figure 4). On the other hand, if the meta-analyses were sufficiently powered (for example, the true overall OR=2.0), upward biases were not so pronounced in the simulated range of τ2.

Figure 4
figure 4

Simulations for the powers in random effects model (REM) meta-analyses of detecting a gene–disease association at the significance level of 5.7 × 10−7 (the top row) and the mean odds ratio (OR) of the simulations passing the threshold (the bottom row) as the function of τ2, the disease allele frequency fA=0.1 or 0.3, and the overall OR=1.4 or 2.0. When the overall OR=2.0, the lines of the powers for scenarios II, III and IV are overlapping. The description of each simulation scenario is in Table 1.

Our simulation suggests that the power of meta-analysis of GWA data sets to detect small genetic effect would decrease due to between-study heterogeneity (τ20.02). As a result, the discovered gene–disease association could have inflated effect (winner's curse phenomenon). Such a winner's curse phenomenon can be seen even to the extent that the between-study heterogeneity could not be fully identified. Similar results were obtained when different genetic models (that is, recessive and additive in log-odds scale models) were examined (data not shown).

Conclusion

We reviewed the process and the methods of meta-analysis of genetic association studies. To conduct and report a transparent meta-analysis, the search strategy, the inclusion or exclusion criteria of studies and the statistical procedures should be fully described. Assessment of HWE and determination of genetic model are methodological issues relevant to meta-analysis of genetic association studies.

In genetic association studies of common disease, effect size of consistently replicated gene–disease associations were found to be small (OR=1.2–1.5);15 therefore, meta-analysis of GWA data sets is the most important approach to increase the power to detect such gene–disease associations.35

Our simulation shows that the power of REM meta-analysis of GWA data sets to detect a small genetic effect could decrease due to between-study heterogeneity and then the mean OR of the simulated meta-analyses that passing the genome-wide significance threshold would be upwardly biased. Recently, Moonesinghe et al.76 show that the required sample size in meta-analysis to detect an overall association with adequate power at a significant level increases as between-study heterogeneity increases and when the between-study heterogeneity exceeds a threshold, meta-analysis cannot reach the power regardless of how large included studies are. At the same time, empirical evaluation of published meta-analyses61 and our simulation study show the uncertainty of estimated between-study heterogeneity is large unless many studies are combined.

These findings suggest that when a meta-analysis of GWA data sets shows association signals reaching genome-wide significance with small between-study heterogeneity, the result should be cautiously reported and further replication studies by institutions other than GWA teams are required.35 Moreover, when a large number of data sets are available, challenges to explain and reduce the observed between-study heterogeneity may become important.74, 76 The knowledge about the potential causes of between-study heterogeneity may help. Such post-GWA research will enable us to map the causative variant finely79 or to detect polymorphisms associated with clinically important subtypes of diseases.80