Author
Abstract
Polygenic scores quantify the genetic risk associated with a given phenotype and are widely used to predict the risk of complex diseases. There has been recent interest in developing methods to construct polygenic risk scores using summary statistic data. We propose a method to construct polygenic risk scores via penalized regression using summary statistic data and publicly available reference data. Our method bears similarity to existing method LassoSum, extending their framework to the Truncated Lasso Penalty (TLP) and the elastic net. We show via simulation and real data application that the TLP improves predictive accuracy as compared to the LASSO while imposing additional sparsity where appropriate. To facilitate model selection in the absence of validation data, we propose methods for estimating model fitting criteria AIC and BIC. These methods approximate the AIC and BIC in the case where we have a polygenic risk score estimated on summary statistic data and no validation data. Additionally, we propose the so-called quasi-correlation metric, which quantifies the predictive accuracy of a polygenic risk score applied to out-of-sample data for which we have only summary statistic information. In total, these methods facilitate estimation and model selection of polygenic risk scores on summary statistic data, and the application of these polygenic risk scores to out-of-sample data for which we have only summary statistic information. We demonstrate the utility of these methods by applying them to GWA studies of lipids, height, and lung cancer.Author summary: Polygenic risk scores use genetic data to predict the genetic risk associated with a given phenotype. Often, due to privacy concerns, genetic data is provided in a limited format called summary statistics. This means that we have limited data with which to estimate polygenic risk scores and cannot apply many standard modelling techniques. We provide novel methods for the estimation of polygenic risk scores via penalized regression using summary statistics, and make software available to do this estimation. We also provide novel methods for model selection and the assessment of model performance in the summary statistic framework. In total, this enables us to use summary statistic data to estimate polygenic risk scores, select a polygenic risk score from among a set of candidate models, and assess the performance of these models. This allows us to leverage summary statistic data to better understand genetic risk. We establish the usefulness of our novel methods via simulation, and apply them to genetic analyses of height, blood lipid levels, and lung cancer.
Suggested Citation
Jack Pattee & Wei Pan, 2020.
"Penalized regression and model selection methods for polygenic scores on summary statistics,"
PLOS Computational Biology, Public Library of Science, vol. 16(10), pages 1-27, October.
Handle:
RePEc:plo:pcbi00:1008271
DOI: 10.1371/journal.pcbi.1008271
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008271. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.