IDEAS home Printed from https://ideas.repec.org/a/bpj/sagmbi/v15y2016i4p321-347n4.html
   My bibliography  Save this article

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches

Author

Listed:
  • Wang Chamont

    (Department of Mathematics and Statistics, The College of New Jersey, Ewing, NJ 08628, USA)

  • Gevertz Jana L.

    (Department of Mathematics and Statistics, The College of New Jersey, Ewing, NJ 08628, USA)

Abstract

Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal – we call these causative genes. We found that not one statistical method tested could identify all the causative genes for all of the simulated cancers, even though increasing the sample size does improve the variable selection capabilities in most cases. Furthermore, certain statistical tools can classify our simulated data with a low error rate, yet the variables being used for classification are not necessarily the causative genes.

Suggested Citation

  • Wang Chamont & Gevertz Jana L., 2016. "Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 15(4), pages 321-347, August.
  • Handle: RePEc:bpj:sagmbi:v:15:y:2016:i:4:p:321-347:n:4
    DOI: 10.1515/sagmb-2015-0072
    as

    Download full text from publisher

    File URL: https://doi.org/10.1515/sagmb-2015-0072
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.1515/sagmb-2015-0072?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    2. Bradley Efron & Nancy R. Zhang, 2011. "False discovery rates and copy number variation," Biometrika, Biometrika Trust, vol. 98(2), pages 251-271.
    3. Kim‐Anh Do & Peter Müller & Feng Tang, 2005. "A Bayesian mixture model for differential gene expression," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 54(3), pages 627-644, June.
    4. John D. Storey & Jonathan E. Taylor & David Siegmund, 2004. "Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 66(1), pages 187-205, February.
    5. Zuber Verena & Strimmer Korbinian, 2011. "High-Dimensional Regression and Variable Selection Using CAR Scores," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-27, July.
    6. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
    7. Jerome H. Friedman, 2006. "Recent Advances in Predictive (Machine) Learning," Journal of Classification, Springer;The Classification Society, vol. 23(2), pages 175-197, September.
    8. Adrien Jamain & David Hand, 2008. "Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation," Journal of Classification, Springer;The Classification Society, vol. 25(1), pages 87-112, June.
    9. Nema Dean & Adrian Raftery, 2010. "Latent class analysis variable selection," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 62(1), pages 11-35, February.
    10. Marine Jeanmougin & Aurelien de Reynies & Laetitia Marisa & Caroline Paccard & Gregory Nuel & Mickael Guedj, 2010. "Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies," PLOS ONE, Public Library of Science, vol. 5(9), pages 1-9, September.
    11. David J. Hand, 2012. "Assessing the Performance of Classification Methods," International Statistical Review, International Statistical Institute, vol. 80(3), pages 400-414, December.
    12. Stigler, Stephen M., 2010. "The Changing History of Robustness," The American Statistician, American Statistical Association, vol. 64(4), pages 277-281.
    13. Ming Yuan & Yi Lin, 2007. "On the non‐negative garrotte estimator," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 69(2), pages 143-161, April.
    14. Leek Jeffrey T & Storey John D., 2011. "The Joint Null Criterion for Multiple Hypothesis Tests," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-22, June.
    15. Hand David J, 2008. "Breast Cancer Diagnosis from Proteomic Mass Spectrometry Data: A Comparative Evaluation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 7(2), pages 1-23, December.
    16. John D. Storey, 2002. "A direct approach to false discovery rates," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(3), pages 479-498, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jianqing Fan & Xu Han, 2017. "Estimation of the false discovery proportion with unknown dependence," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 1143-1164, September.
    2. Qingyun Cai & Hock Peng Chan, 2017. "A Double Application of the Benjamini-Hochberg Procedure for Testing Batched Hypotheses," Methodology and Computing in Applied Probability, Springer, vol. 19(2), pages 429-443, June.
    3. Wang, Kangning & Li, Shaomin, 2021. "Robust distributed modal regression for massive data," Computational Statistics & Data Analysis, Elsevier, vol. 160(C).
    4. Mogliani, Matteo & Simoni, Anna, 2021. "Bayesian MIDAS penalized regressions: Estimation, selection, and prediction," Journal of Econometrics, Elsevier, vol. 222(1), pages 833-860.
    5. Zhaoping Hong & Yuao Hu & Heng Lian, 2013. "Variable selection for high-dimensional varying coefficient partially linear models via nonconcave penalty," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 76(7), pages 887-908, October.
    6. Fang, Xiaolei & Paynabar, Kamran & Gebraeel, Nagi, 2017. "Multistream sensor fusion-based prognostics model for systems with single failure modes," Reliability Engineering and System Safety, Elsevier, vol. 159(C), pages 322-331.
    7. Belli, Edoardo, 2022. "Smoothly adaptively centered ridge estimator," Journal of Multivariate Analysis, Elsevier, vol. 189(C).
    8. Alena Skolkova, 2023. "Instrumental Variable Estimation with Many Instruments Using Elastic-Net IV," CERGE-EI Working Papers wp759, The Center for Economic Research and Graduate Education - Economics Institute, Prague.
    9. Howard D. Bondell & Brian J. Reich, 2009. "Simultaneous Factor Selection and Collapsing Levels in ANOVA," Biometrics, The International Biometric Society, vol. 65(1), pages 169-177, March.
    10. Lan, Wei & Zhong, Ping-Shou & Li, Runze & Wang, Hansheng & Tsai, Chih-Ling, 2016. "Testing a single regression coefficient in high dimensional linear models," Journal of Econometrics, Elsevier, vol. 195(1), pages 154-168.
    11. Cai, Qingyun, 2018. "A scoring criterion for rejection of clustered p-values," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 180-189.
    12. Kangning Wang & Lu Lin, 2019. "Robust and efficient estimator for simultaneous model structure identification and variable selection in generalized partial linear varying coefficient models with longitudinal data," Statistical Papers, Springer, vol. 60(5), pages 1649-1676, October.
    13. Wei Sun & Lexin Li, 2012. "Multiple Loci Mapping via Model-free Variable Selection," Biometrics, The International Biometric Society, vol. 68(1), pages 12-22, March.
    14. Zhao, Weihua & Lian, Heng, 2017. "Quantile index coefficient model with variable selection," Journal of Multivariate Analysis, Elsevier, vol. 154(C), pages 40-58.
    15. Yichao Wu, 2011. "An ordinary differential equation-based solution path algorithm," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 23(1), pages 185-199.
    16. Heewon Park & Fumitake Sakaori, 2013. "Lag weighted lasso for time series model," Computational Statistics, Springer, vol. 28(2), pages 493-504, April.
    17. Lian, Heng & Feng, Sanying & Zhao, Kaifeng, 2015. "Parametric and semiparametric reduced-rank regression with flexible sparsity," Journal of Multivariate Analysis, Elsevier, vol. 136(C), pages 163-174.
    18. Anestis Antoniadis & Irène Gijbels & Mila Nikolova, 2011. "Penalized likelihood regression for generalized linear models with non-quadratic penalties," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 63(3), pages 585-615, June.
    19. Sijian Wang & Ji Zhu, 2008. "Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data," Biometrics, The International Biometric Society, vol. 64(2), pages 440-448, June.
    20. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:sagmbi:v:15:y:2016:i:4:p:321-347:n:4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyter.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.