Improved Polygenic Prediction with Multi-Ancestry and Multi-Trait GWAS Data

Improved Polygenic Prediction with Multi-Ancestry and Multi-Trait GWAS Data

We use new methods and diverse sources of GWAS data to improve the accuracy of GRS when used applied to African, East Asian, and South Asian populations.

Written by Elan Bechor, PhD

Overview

In this whitepaper, we describe a method to improve the performance of genetic risk scores (GRS) across multiple ancestries – including European, East Asian, South Asian, and African populations. The method leverages predictions from genetically correlated traits and GWAS (genome-wide association study) of non-European ancestry. We replicate previous research[1][2] that used pleiotropy to improve prediction accuracy and make further gains by incorporating newly developed methods[3][4] that use ancestrally diverse GWAS data, which capitalize on the diversity of linkage disequilibrium across discovery samples. By integrating these two sources of data, we successfully improved the predictive performance of GRS for a wide range of diseases for Europeans, with a relative increase in effect size (log odds ratio per standard deviation of GRS) with a inverse-variance weighted average increase of 23.7% across n = 8 diseases. Consistent with other work[1], the predictive performance also improved in non-Europeans, with a 24.8% average relative increase in effect size for South Asians, and 29.6% for Africans. Due to the inclusion of a large East Asian biobank, the improvement in prediction was particularly remarkable among East Asians, where the gain was 53.6%. This approach promises to advance the use of genetic risk prediction in preimplantation genetic testing by providing more accurate and inclusive scores for diverse populations.

Data and Validation Cohorts

The models for a set of 8 diseases were improved by adding in new data from several sources: non-European data from the Biobank of Japan[5], Finnish data from FinnGen[6], and genetically correlated traits from within the UK Biobank (discovered by examining the UK Biobank Genetic Correlation browser from the Neale Lab). These were compared to the original models developed by using a standard approach of taking a large GWAS and creating a polygenic score using the PRScs[7] software or pruning and thresholding[8].

For quality control, we removed samples that failed standard quality control(due to missing genotypes, genetic sex not matching self-reported sex, and genetic ancestry not matching self-reported ancestry). For the African/Caribbean, South Asian, and East Asian cohorts, we removed any samples that were genetically related (up to 3rd degree) to other samples in the UK Biobank. For the White British samples, we split the dataset into two cohorts:

  • Cohort 1: Samples with no relatives (up to third degree) within the UK Biobank (n=276,471)
  • Cohort 2: Samples with one or more relatives in the UK Biobank, where one relative from each family was selected (n=58,808)

For type 2 diabetes and coronary artery disease, we trained GRS models that include cohort 1 and tested on cohort 2 as validation. For all other conditions, we tested on cohort 1 since we did not use any UK Biobank GWAS data.

Cohort Number of samples
White British, unrelated (cohort 1) 276,471
White British, relatives in UK Biobank (cohort 2) 58,808
East Asian 1,350
South Asian 6,433
African and Caribbean 6,415

Incorporating Genetically Correlated Traits

A number of research papers have shown predictive performance improves by creating a linear combination of polygenic score models (multi-PGS) whose weights are determined with elastic net regularization. For example, including the PGS for schizophrenia as a feature in predicting depression improves the prediction of the latter[1]. For two of these diseases – breast cancer and atrial fibrillation –  there were no strongly genetically correlated traits in the UK Biobank or otherwise. 

For the three psychiatric conditions (depression, bipolar disorder, schizophrenia), because each is significantly genetically correlated with the other two[9], we built a multi-PGS on all predictors for the three psychiatric disorders.

For type 2 diabetes, class III obesity and coronary artery disease, we scanned the UK Biobank for the top 25 ranked by genetic covariance and trained GRS models using PRScs software.

Incorporating East Asian and Finnish GWAS

For all diseases we included data from FinnGen using models trained with PRScs. Additionally, we employed a multi-ancestry approach, PRScsx, to jointly analyze European and East Asian data from the Biobank of Japan. This technique has demonstrated enhanced predictive performance among East Asians and, in some circumstances, improved relative performance for Africans and South Asians. 

Disease Benchmark PGS / Training Method Additional Sources of Data
Atrial Fibrillation Christophersen et al. (2017) [10] / PRScs Biobank of Japan (BBJ); FinnGen
Schizophrenia PGC Schizophrenia Wave 3 [11] / Pruning + Thresholding Genetically correlated traits; BBJ, FinnGen
Depression Wray et al. (2018) [12] / PRScs Genetically correlated traits; BBJ, FinnGen
Coronary Artery Disease Nikpay et al. (2015) [13] / PRScs Genetically correlated traits; BBJ, FinnGen
Breast Cancer Michailidou et al. (2015) [14] / PRScs BBJ, FinnGen
Type 2 Diabetes Scott RA, et al. (2017) [15] / PRScs Genetically correlated traits; BBJ, FinnGen
Bipolar Disorder Stahl et al. (2019)[16] / Pruning + Thresholding Genetically correlated traits; BBJ, FinnGen
Class III Obesity Khera et al. (2019) / PRScs BBJ, FinnGen

Table 2: Description of benchmark PGS and additional data used to train improved models. 

Results for Europeans

For each disease, the collection of trained models were combined into a multi-PGS with a logistic regression using elastic net regularization. Performance for these diseases was evaluated on the Cohort 1, except for CAD and Type 2 Diabetes, which were evaluated on Cohort 2 because their multi-PGS incorporated models trained on the Cohort 1. Improvements were strong, with a mean improvement of 28.2% in effect sizes (log odds ratio per standard deviation) across the 8 diseases. This estimate weighs each disease equally, but the error bars are wider for more rare diseases, so we also report the average improvement weighted by inverse of the standard errors, which is 23.7% for Europeans. Relative performance increases were the strongest in type 2 diabetes and schizophrenia, which can be explained by the large numbers of cases in the additional data and the high SNP heritability of the disease.

Figure 1: Improved results of PGS models on White British population in the UK Biobank.

Performance Improvements successfully generalized across ancestries

Genetic risk score performance improved across all ancestries, with East Asians having a gain relative to Europeans due to the inclusion of data from the Biobank of Japan. The gains were particularly significant within the East Asian population, which aligns with the majority of non-European GWAS data used originating from the Biobank of Japan.

Population Relative gain (log odds ratio per standard deviation, weighted by inverse variance)
East Asians (n = 1,350) +53.6%
South Asians (n = 6,433) +24.8%
African / Caribbean (n = 6,414) +29.6%
Europeans (n = 58,808 or 276,461) +23.7%

Table 3: Improvements in performance across different ancestries by inverse weighted variance. Inverse weighted variance is a weighted average that assigns weights to quantities by the inverse of the variance, i.e. the precision of the estimate, which assigns more weight to diseases with larger numbers of cases.  

For the two most common diseases in the UK Biobank (coronary artery disease and type 2 diabetes), we depict here the odds ratios of the top 10% of PRS versus the bottom 90% in the improved and original models.

Figure 2: Type 2 Diabetes odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.
Figure 3: Coronary Artery Disease odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.

Discussion

We have evaluated the performance of predictors that incorporate non-European GWAS and genetically correlated traits, showing that the performance improves across all ancestries. Relative performance increases were especially high in East Asians because of the joint inference on multi-ancestry data that included the Biobank of Japan. The results demonstrate that the performance of Genetic Risk Scores scores can be improved by diverse data and replicate the findings that summary statistics from large non-European biobanks can help improve equity in genomic medicine.

Citations

  1. Albiñana, C., Zhu, Z., Schork, A. J., Ingason, A., Aschard, H., Brikell, I., ... Vilhjálmsson, B. J. (2022). Multi-PGS enhances polygenic prediction: weighting 937 polygenic scores. medRxiv, 2022.09.14.22279940. https://doi.org/10.1101/2022.09.14.22279940
  2. Truong, B., Hull, L. E., Ruan, Y., Huang, Q. Q., Hornsby, W., Martin, H. C., ... Natarajan, P. (2023, March 23). Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. medRxiv [Preprint]. 2023.02.21.23286110. https://doi.org/10.1101/2023.02.21.23286110
  3. Ruan, Y., Lin, Y. F., Feng, Y. A., Chen, C. Y., Lam, M., Guo, Z., ... Ge, T. (2022, May). Improving polygenic prediction in ancestrally diverse populations. Nat Genet, 54(5), 573-580. https://doi.org/10.1038/s41588-022-01054-7
  4. Zheng, Z., Liu, S., Sidorenko, J., Yengo, L., Turley, P., Ani, A., ... Zeng, J. (2022). Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. bioRxiv, 2022.10.12.510418. https://doi.org/10.1101/2022.10.12.510418
  5. Nagai, A., Hirata, M., Kamatani, Y., Muto, K., Matsuda, K., Kiyohara, Y., ... Nakamura, Y. (2017, March). Overview of the BioBank Japan Project: Study design and profile. J Epidemiol, 27(3S), S2-S8. https://doi.org/10.1016/j.je.2016.12.005
  6. Kurki, M. I., Karjalainen, J., Palta, P., et al. (2023). FinnGen provides genetic insights from a well-phenotyped isolated population. Nature, 613, 508-518. https://doi.org/10.1038/s41586-022-05473-8
  7. Ge, T., Chen, C. Y., Ni, Y., et al. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 10, 1776. https://doi.org/10.1038/s41467-019-09718-5
  8. Privé, F., Vilhjálmsson, B. J., Aschard, H., & Blum, M. G. B. (2019). Making the Most of Clumping and Thresholding for Polygenic Scores. American Journal of Human Genetics, https://doi.org/10.1016/j.ajhg.2019.11.001
  9. Abdellaoui, A., Smit, D. J. A., van den Brink, W., Denys, D., & Verweij, K. J. H. (2021, March 1). Genomic relationships across psychiatric disorders including substance use disorders. Drug and Alcohol Dependence, 220, 108535. https://doi.org/10.1016/j.drugalcdep.2021.108535
  10. Christophersen, I. E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., ... Guo, X.; METASTROKE Consortium of the ISGC; Neurology Working Group of the CHARGE Consortium; Dichgans, M., Ingelsson, E., Kooperberg, C., Melander, O., Loos, R. J. F., Laurikka, J., ... Ellinor, P. T.; AFGen Consortium. (2017, June). Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat Genet, 49(6), 946-952. https://doi.org/10.1038/ng.3843
  11. Trubetskoy, V., Pardiñas, A. F., Qi, T., Panagiotaropoulou, G., Awasthi, S., Bigdeli, T. B., ... Chung, M. K., Felix, S. B., Gudnason, V., Alonso, A., Roden, D. M., Kääb, S., Chasman, D. I., Heckbert, S. R., Benjamin, E. J., Tanaka, T., Lunetta, K. L., Lubitz, S. A., & Ellinor, P. T. (2022). Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature, 604(7906), 502-508. https://doi.org/10.1038/s41586-022-04434-5
  12. Wray, N. R., Ripke, S., Mattheisen, M., et al. (2018). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet, 50, 668-681. https://doi.org/10.1038/s41588-018-0090-3
  13. Nikpay, M., Goel, A., Won, H. H., Hall, L. M., Willenborg, C., Kanoni, S., ... Farrall, M. (2015, October). A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet, 47(10), 1121-1130. https://doi.org/10.1038/ng.3396
  14. Michailidou, K., Lindström, S., Dennis, J., Beesley, J., Hui, S., Kar, S., ... Easton, D. F. (2017). Association analysis identifies 65 new breast cancer risk loci. Nature, 551(7678), 92-94. https://doi.org/10.1038/nature24284
  15. Scott, R. A., Scott, L. J., Mägi, R., Marullo, L., Gaulton, K. J., Kaakinen, M., ... McCarthy, M. I.; DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium. (2017, November). An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes, 66(11), 2888-2902. https://doi.org/10.2337/db16-1253
  16. Stahl, E. A., Breen, G., Forstner, A. J., et al. (2019). Genome-wide association study identifies 30 loci associated with bipolar disorder. Nat Genet, 51, 793-803. https://doi.org/10.1038/s41588-019-0397-8

Orchid Health supports open research data initiatives while abiding by the terms of use on all genetic risk models and datasets.  PGC data was used in this study for the evaluation of the potential of multi-PGS model training technique only in a research context.

Supplementary Tables

Supplementary Table A: How each disease case is defined in evaluating genetic risk scores in the UK Biobank

Phenotype ICD-10 Codes Self-Report Codes Cases in UK Biobank (White British)
Prostate cancer C61, D075 1044 13,806
Type 2 diabetes E11.1-9 1223 30,507
Coronary artery disease I210-4,I219,I220I221,I228, I232, I233, I235, I236, I238, I249, I252 1075 22,451
Breast cancer C5.0-9, D05.0, D059 1002 18,588
Atrial fibrillation I48.0-4, I48.9 1471, 1483 22,472
Schizophrenia F20.0-9, F21, F23.0-3, F23.8 1289 1,376
Class III Obesity* - -
Depression** - -
Bipolar disorder F31 1291 1,855

  • Class III Obesity was defined as having a BMI (UK Biobank Field 21001) of 40 kg/m2 or above.
  • The depression phenotype was defined for participants who participated in the Mental Health Survey who had researcher-derived “probable recurrent depression (severe)”, and controls excluded participants with any depression or bipolar. 

Supplementary Table B1-B10

Number of Heart Disease cases in test set: 1765 (prevalence of 5.38% in Cohort 1 overall)

Coronary Artery Disease Odds Ratio (Improved Model) Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 3.86 (3.12, 4.76) 19.0% 3.04 (2.43, 3.82)
Top 5% 2.89 (2.48, 3.38) 14.5% 2.41 (2.05, 2.84)
Top 10% 2.68 (2.37, 3.02) 12.9% 2.28 (2.01, 2.58)

Number of Breast Cancer cases in test set: 6061 (prevalence of 7.45% in Cohort 1 females overall)

Breast Cancer Odds Ratio (Improved Model) Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 4.24 (3.76, 4.77) 26.5% 3.95 (3.50, 4.45)
Top 5% 3.34 (3.07, 3.63) 21.4% 3.23 (2.97, 3.52)
Top 10% 3.05 (2.86, 3.26) 18.7% 2.83 (2.65, 3.02)

Number of Schizophrenia cases in test set: 476 (prevalence of 0.27% in Cohort 1 overall)

Schizophrenia Odds Ratio (Improved Model) Case Prevalence at Cutoff (Improved Model Odds Ratio (Baseline)
Top 2% 4.39 (3.14, 6.13) 1.37% 3.15 (2.14, 4.62)
Top 5% 3.61 (2.81, 4.63) 1.07% 2.29 (1.70, 3.07)
Top 10% 2.85 (2.31, 3.53) 0.81% 1.95 (1.53, 2.47)

Number of Type 2 Diabetes cases in test set: 2363 (prevalence of 6.9% in Cohort 2 overall)

Type 2 Diabetes Odds Ratio (Improved Model) Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 4.07 (3.36, 4.92) 25.3% 2.90 (2.36, 3.57)
Top 5% 3.48 (3.05, 3.97) 21.5% 2.38 (2.06, 2.76)
Top 10% 3.06 (2.76, 3.40) 18.5% 2.21 (1.98, 2.48)

Number of bipolar cases in test set: 640 (prevalence of 0.41% in Cohort 1 overall)

Bipolar Odds Ratio (Improved Model) Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 3.57 (2.61, 4.87) 1.6% 3.75 (2.76, 5.09)
Top 5% 2.69 (2.13, 3.41) 1.1% 2.62 (2.06, 3.32)
Top 10% 2.61 (2.29, 3.31) 1.06% 2.46 (2.04, 2.98)

Number of atrial fibrillation cases in test set: 7502

Atrial Fibrillation Odds Ratio Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 3.62 (3.26, 4.02) 16.6% 2.91 (2.60, 3.25)
Top 5% 3.02 (2.80, 3.25) 11.6% 2.45 (2.27, 2.65)
Top 10% 2.67 (2.53, 2.84) 10.3% 2.23 (2.10, 2.38)

Number of depression cases in test set: 2415

Depression Odds Ratio Case Prevalence at Cutoff (Improved Model ) Odds Ratio (Baseline)
Top 2% 2.28 (1.822, 2.85) 18.1% 2.02 (1.60, 2.55)
Top 5% 2.06 (1.77, 2.39) 7.63% 1.92 (1.71, 2.16)
Top 10% 1.92 (1.171, 2.16) 6.15% 1.54 (1.37, 1.74)

Number of class III obesity cases in test set: 569 (prevalence of 1.39% in Cohort 2 overall)

Class III Obesity Odds Ratio Case Prevalence at Cutoff (Improved Model) Odds Ratio (Baseline)
Top 2% 7.51 (5.76, 9.80) 11.7% 6.05 (4.55, 8.04)
Top 5% 5.75 (4.68, 7.06) 8.5% 5.25 (4.26, 6.48)
Top 10% 5.24 (4.40, 6.25) 6.9% 4.22 (3.52, 5.07)

get access

Get expert reviewed guides hot off the presses.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Recent Articles