Improved Polygenic Prediction with Multi-Ancestry and Multi-Trait GWAS Data

Overview

In this whitepaper, we describe a method to improve the performance of genetic risk scores (GRS) across multiple ancestries – including European, East Asian, South Asian, and African populations. The method leverages predictions from genetically correlated traits and GWAS (genome-wide association study) of non-European ancestry. We replicate previous research^[1][2] that used pleiotropy to improve prediction accuracy and make further gains by incorporating newly developed methods^[3][4] that use ancestrally diverse GWAS data, which capitalize on the diversity of linkage disequilibrium across discovery samples. By integrating these two sources of data, we successfully improved the predictive performance of GRS for a wide range of diseases for Europeans, with a relative increase in effect size (log odds ratio per standard deviation of GRS) with a inverse-variance weighted average increase of 23.7% across n = 8 diseases. Consistent with other work^[1], the predictive performance also improved in non-Europeans, with a 24.8% average relative increase in effect size for South Asians, and 29.6% for Africans. Due to the inclusion of a large East Asian biobank, the improvement in prediction was particularly remarkable among East Asians, where the gain was 53.6%. This approach promises to advance the use of genetic risk prediction in preimplantation genetic testing by providing more accurate and inclusive scores for diverse populations.

Data and Validation Cohorts

The models for a set of 8 diseases were improved by adding in new data from several sources: non-European data from the Biobank of Japan^[5], Finnish data from FinnGen^[6], and genetically correlated traits from within the UK Biobank (discovered by examining the UK Biobank Genetic Correlation browser from the Neale Lab). These were compared to the original models developed by using a standard approach of taking a large GWAS and creating a polygenic score using the PRScs^[7] software or pruning and thresholding^[8].

For quality control, we removed samples that failed standard quality control(due to missing genotypes, genetic sex not matching self-reported sex, and genetic ancestry not matching self-reported ancestry). For the African/Caribbean, South Asian, and East Asian cohorts, we removed any samples that were genetically related (up to 3rd degree) to other samples in the UK Biobank. For the White British samples, we split the dataset into two cohorts:

Cohort 1: Samples with no relatives (up to third degree) within the UK Biobank (n=276,471)

Cohort 2: Samples with one or more relatives in the UK Biobank, where one relative from each family was selected (n=58,808)

For type 2 diabetes and coronary artery disease, we trained GRS models that include cohort 1 and tested on cohort 2 as validation. For all other conditions, we tested on cohort 1 since we did not use any UK Biobank GWAS data.

Cohort	Number of samples
White British, unrelated (cohort 1)	276,471
White British, relatives in UK Biobank (cohort 2)	58,808
East Asian	1,350
South Asian	6,433
African and Caribbean	6,415

Incorporating Genetically Correlated Traits

A number of research papers have shown predictive performance improves by creating a linear combination of polygenic score models (multi-PGS) whose weights are determined with elastic net regularization. For example, including the PGS for schizophrenia as a feature in predicting depression improves the prediction of the latter^[1]. For two of these diseases – breast cancer and atrial fibrillation – there were no strongly genetically correlated traits in the UK Biobank or otherwise.

For the three psychiatric conditions (depression, bipolar disorder, schizophrenia), because each is significantly genetically correlated with the other two^[9], we built a multi-PGS on all predictors for the three psychiatric disorders.

For type 2 diabetes, class III obesity and coronary artery disease, we scanned the UK Biobank for the top 25 ranked by genetic covariance and trained GRS models using PRScs software.

Incorporating East Asian and Finnish GWAS

For all diseases we included data from FinnGen using models trained with PRScs. Additionally, we employed a multi-ancestry approach, PRScsx, to jointly analyze European and East Asian data from the Biobank of Japan. This technique has demonstrated enhanced predictive performance among East Asians and, in some circumstances, improved relative performance for Africans and South Asians.

Disease	Benchmark PGS / Training Method	Additional Sources of Data
Atrial Fibrillation	Christophersen et al. (2017) [10] / PRScs	Biobank of Japan (BBJ); FinnGen
Schizophrenia	PGC Schizophrenia Wave 3 [11] / Pruning + Thresholding	Genetically correlated traits; BBJ, FinnGen
Depression	Wray et al. (2018) [12] / PRScs	Genetically correlated traits; BBJ, FinnGen
Coronary Artery Disease	Nikpay et al. (2015) [13] / PRScs	Genetically correlated traits; BBJ, FinnGen
Breast Cancer	Michailidou et al. (2015) [14] / PRScs	BBJ, FinnGen
Type 2 Diabetes	Scott RA, et al. (2017) [15] / PRScs	Genetically correlated traits; BBJ, FinnGen
Bipolar Disorder	Stahl et al. (2019)[16] / Pruning + Thresholding	Genetically correlated traits; BBJ, FinnGen
Class III Obesity	Khera et al. (2019) / PRScs	BBJ, FinnGen

Table 2: Description of benchmark PGS and additional data used to train improved models.

Results for Europeans

For each disease, the collection of trained models were combined into a multi-PGS with a logistic regression using elastic net regularization. Performance for these diseases was evaluated on the Cohort 1, except for CAD and Type 2 Diabetes, which were evaluated on Cohort 2 because their multi-PGS incorporated models trained on the Cohort 1. Improvements were strong, with a mean improvement of 28.2% in effect sizes (log odds ratio per standard deviation) across the 8 diseases. This estimate weighs each disease equally, but the error bars are wider for more rare diseases, so we also report the average improvement weighted by inverse of the standard errors, which is 23.7% for Europeans. Relative performance increases were the strongest in type 2 diabetes and schizophrenia, which can be explained by the large numbers of cases in the additional data and the high SNP heritability of the disease.

‍

Figure 1: Improved results of PGS models on White British population in the UK Biobank.

Performance Improvements successfully generalized across ancestries

Genetic risk score performance improved across all ancestries, with East Asians having a gain relative to Europeans due to the inclusion of data from the Biobank of Japan. The gains were particularly significant within the East Asian population, which aligns with the majority of non-European GWAS data used originating from the Biobank of Japan.

Population	Relative gain (log odds ratio per standard deviation, weighted by inverse variance)
East Asians (n = 1,350)	+53.6%
South Asians (n = 6,433)	+24.8%
African / Caribbean (n = 6,414)	+29.6%
Europeans (n = 58,808 or 276,461)	+23.7%

Table 3: Improvements in performance across different ancestries by inverse weighted variance. Inverse weighted variance is a weighted average that assigns weights to quantities by the inverse of the variance, i.e. the precision of the estimate, which assigns more weight to diseases with larger numbers of cases.

For the two most common diseases in the UK Biobank (coronary artery disease and type 2 diabetes), we depict here the odds ratios of the top 10% of PRS versus the bottom 90% in the improved and original models.

Figure 2: Type 2 Diabetes odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.

Figure 3: Coronary Artery Disease odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.

Discussion

We have evaluated the performance of predictors that incorporate non-European GWAS and genetically correlated traits, showing that the performance improves across all ancestries. Relative performance increases were especially high in East Asians because of the joint inference on multi-ancestry data that included the Biobank of Japan. The results demonstrate that the performance of Genetic Risk Scores scores can be improved by diverse data and replicate the findings that summary statistics from large non-European biobanks can help improve equity in genomic medicine.

Citations

Albiñana, C., Zhu, Z., Schork, A. J., Ingason, A., Aschard, H., Brikell, I., ... Vilhjálmsson, B. J. (2022). Multi-PGS enhances polygenic prediction: weighting 937 polygenic scores. medRxiv, 2022.09.14.22279940. https://doi.org/10.1101/2022.09.14.22279940
Truong, B., Hull, L. E., Ruan, Y., Huang, Q. Q., Hornsby, W., Martin, H. C., ... Natarajan, P. (2023, March 23). Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. medRxiv [Preprint]. 2023.02.21.23286110. https://doi.org/10.1101/2023.02.21.23286110
Ruan, Y., Lin, Y. F., Feng, Y. A., Chen, C. Y., Lam, M., Guo, Z., ... Ge, T. (2022, May). Improving polygenic prediction in ancestrally diverse populations. Nat Genet, 54(5), 573-580. https://doi.org/10.1038/s41588-022-01054-7
Zheng, Z., Liu, S., Sidorenko, J., Yengo, L., Turley, P., Ani, A., ... Zeng, J. (2022). Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. bioRxiv, 2022.10.12.510418. https://doi.org/10.1101/2022.10.12.510418
Nagai, A., Hirata, M., Kamatani, Y., Muto, K., Matsuda, K., Kiyohara, Y., ... Nakamura, Y. (2017, March). Overview of the BioBank Japan Project: Study design and profile. J Epidemiol, 27(3S), S2-S8. https://doi.org/10.1016/j.je.2016.12.005
Kurki, M. I., Karjalainen, J., Palta, P., et al. (2023). FinnGen provides genetic insights from a well-phenotyped isolated population. Nature, 613, 508-518. https://doi.org/10.1038/s41586-022-05473-8
Ge, T., Chen, C. Y., Ni, Y., et al. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 10, 1776. https://doi.org/10.1038/s41467-019-09718-5
Privé, F., Vilhjálmsson, B. J., Aschard, H., & Blum, M. G. B. (2019). Making the Most of Clumping and Thresholding for Polygenic Scores. American Journal of Human Genetics, https://doi.org/10.1016/j.ajhg.2019.11.001
Abdellaoui, A., Smit, D. J. A., van den Brink, W., Denys, D., & Verweij, K. J. H. (2021, March 1). Genomic relationships across psychiatric disorders including substance use disorders. Drug and Alcohol Dependence, 220, 108535. https://doi.org/10.1016/j.drugalcdep.2021.108535
Christophersen, I. E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., ... Guo, X.; METASTROKE Consortium of the ISGC; Neurology Working Group of the CHARGE Consortium; Dichgans, M., Ingelsson, E., Kooperberg, C., Melander, O., Loos, R. J. F., Laurikka, J., ... Ellinor, P. T.; AFGen Consortium. (2017, June). Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat Genet, 49(6), 946-952. https://doi.org/10.1038/ng.3843
Trubetskoy, V., Pardiñas, A. F., Qi, T., Panagiotaropoulou, G., Awasthi, S., Bigdeli, T. B., ... Chung, M. K., Felix, S. B., Gudnason, V., Alonso, A., Roden, D. M., Kääb, S., Chasman, D. I., Heckbert, S. R., Benjamin, E. J., Tanaka, T., Lunetta, K. L., Lubitz, S. A., & Ellinor, P. T. (2022). Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature, 604(7906), 502-508. https://doi.org/10.1038/s41586-022-04434-5
Wray, N. R., Ripke, S., Mattheisen, M., et al. (2018). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet, 50, 668-681. https://doi.org/10.1038/s41588-018-0090-3
Nikpay, M., Goel, A., Won, H. H., Hall, L. M., Willenborg, C., Kanoni, S., ... Farrall, M. (2015, October). A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet, 47(10), 1121-1130. https://doi.org/10.1038/ng.3396
Michailidou, K., Lindström, S., Dennis, J., Beesley, J., Hui, S., Kar, S., ... Easton, D. F. (2017). Association analysis identifies 65 new breast cancer risk loci. Nature, 551(7678), 92-94. https://doi.org/10.1038/nature24284
Scott, R. A., Scott, L. J., Mägi, R., Marullo, L., Gaulton, K. J., Kaakinen, M., ... McCarthy, M. I.; DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium. (2017, November). An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes, 66(11), 2888-2902. https://doi.org/10.2337/db16-1253
Stahl, E. A., Breen, G., Forstner, A. J., et al. (2019). Genome-wide association study identifies 30 loci associated with bipolar disorder. Nat Genet, 51, 793-803. https://doi.org/10.1038/s41588-019-0397-8

Orchid Health supports open research data initiatives while abiding by the terms of use on all genetic risk models and datasets. PGC data was used in this study for the evaluation of the potential of multi-PGS model training technique only in a research context.

Supplementary Tables

Supplementary Table A: How each disease case is defined in evaluating genetic risk scores in the UK Biobank

‍

Phenotype	ICD-10 Codes	Self-Report Codes	Cases in UK Biobank (White British)
Prostate cancer	C61, D075	1044	13,806
Type 2 diabetes	E11.1-9	1223	30,507
Coronary artery disease	I210-4,I219,I220I221,I228, I232, I233, I235, I236, I238, I249, I252	1075	22,451
Breast cancer	C5.0-9, D05.0, D059	1002	18,588
Atrial fibrillation	I48.0-4, I48.9	1471, 1483	22,472
Schizophrenia	F20.0-9, F21, F23.0-3, F23.8	1289	1,376
Class III Obesity*	-	-
Depression**	-	-
Bipolar disorder	F31	1291	1,855

‍

Class III Obesity was defined as having a BMI (UK Biobank Field 21001) of 40 kg/m2 or above.
The depression phenotype was defined for participants who participated in the Mental Health Survey who had researcher-derived “probable recurrent depression (severe)”, and controls excluded participants with any depression or bipolar.

Supplementary Table B1-B10

Number of Heart Disease cases in test set: 1765 (prevalence of 5.38% in Cohort 1 overall)

Coronary Artery Disease	Odds Ratio (Improved Model)	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	3.86 (3.12, 4.76)	19.0%	3.04 (2.43, 3.82)
Top 5%	2.89 (2.48, 3.38)	14.5%	2.41 (2.05, 2.84)
Top 10%	2.68 (2.37, 3.02)	12.9%	2.28 (2.01, 2.58)

‍

Number of Breast Cancer cases in test set: 6061 (prevalence of 7.45% in Cohort 1 females overall)

Breast Cancer	Odds Ratio (Improved Model)	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	4.24 (3.76, 4.77)	26.5%	3.95 (3.50, 4.45)
Top 5%	3.34 (3.07, 3.63)	21.4%	3.23 (2.97, 3.52)
Top 10%	3.05 (2.86, 3.26)	18.7%	2.83 (2.65, 3.02)

‍

Number of Schizophrenia cases in test set: 476 (prevalence of 0.27% in Cohort 1 overall)

Schizophrenia	Odds Ratio (Improved Model)	Case Prevalence at Cutoff (Improved Model	Odds Ratio (Baseline)
Top 2%	4.39 (3.14, 6.13)	1.37%	3.15 (2.14, 4.62)
Top 5%	3.61 (2.81, 4.63)	1.07%	2.29 (1.70, 3.07)
Top 10%	2.85 (2.31, 3.53)	0.81%	1.95 (1.53, 2.47)

‍

Number of Type 2 Diabetes cases in test set: 2363 (prevalence of 6.9% in Cohort 2 overall)

Type 2 Diabetes	Odds Ratio (Improved Model)	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	4.07 (3.36, 4.92)	25.3%	2.90 (2.36, 3.57)
Top 5%	3.48 (3.05, 3.97)	21.5%	2.38 (2.06, 2.76)
Top 10%	3.06 (2.76, 3.40)	18.5%	2.21 (1.98, 2.48)

‍

Number of bipolar cases in test set: 640 (prevalence of 0.41% in Cohort 1 overall)

Bipolar	Odds Ratio (Improved Model)	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	3.57 (2.61, 4.87)	1.6%	3.75 (2.76, 5.09)
Top 5%	2.69 (2.13, 3.41)	1.1%	2.62 (2.06, 3.32)
Top 10%	2.61 (2.29, 3.31)	1.06%	2.46 (2.04, 2.98)

‍

Number of atrial fibrillation cases in test set: 7502

Atrial Fibrillation	Odds Ratio	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	3.62 (3.26, 4.02)	16.6%	2.91 (2.60, 3.25)
Top 5%	3.02 (2.80, 3.25)	11.6%	2.45 (2.27, 2.65)
Top 10%	2.67 (2.53, 2.84)	10.3%	2.23 (2.10, 2.38)

‍

Number of depression cases in test set: 2415

Depression	Odds Ratio	Case Prevalence at Cutoff (Improved Model )	Odds Ratio (Baseline)
Top 2%	2.28 (1.822, 2.85)	18.1%	2.02 (1.60, 2.55)
Top 5%	2.06 (1.77, 2.39)	7.63%	1.92 (1.71, 2.16)
Top 10%	1.92 (1.171, 2.16)	6.15%	1.54 (1.37, 1.74)

‍

Number of class III obesity cases in test set: 569 (prevalence of 1.39% in Cohort 2 overall)

Class III Obesity	Odds Ratio	Case Prevalence at Cutoff (Improved Model)	Odds Ratio (Baseline)
Top 2%	7.51 (5.76, 9.80)	11.7%	6.05 (4.55, 8.04)
Top 5%	5.75 (4.68, 7.06)	8.5%	5.25 (4.26, 6.48)
Top 10%	5.24 (4.40, 6.25)	6.9%	4.22 (3.52, 5.07)

Acknowledgements

This research has been conducted using the UK Biobank Resource under Application Number 80545.