IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v55y2011i1p935-943.html
   My bibliography  Save this article

A statistical approach to high-throughput screening of predicted orthologs

Author

Listed:
  • Min, Jeong Eun
  • Whiteside, Matthew D.
  • Brinkman, Fiona S.L.
  • McNeney, Brad
  • Graham, Jinko

Abstract

Orthologs are genes in different species that have diverged from a common ancestral gene after speciation. In contrast, paralogs are genes that have diverged after a gene duplication event. For many comparative analyses, it is of interest to identify orthologs with similar functions. Such orthologs tend to support species divergence (ssd-orthologs) in the sense that they have diverged only due to speciation, to the same relative degree as their species. However, due to incomplete sequencing or gene loss in a species, predicted orthologs can sometimes be paralogs or other non-ssd-orthologs. To increase the specificity of ssd-ortholog prediction, Fulton et al. [Fulton, D., Li, Y., Laird, M., Horsman, B., Roche, F., Brinkman, F., 2006. Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 7 (1), 270] developed Ortholuge, a bioinformatics tool that identifies predicted orthologs with atypical genetic divergence. However, when the initial list of putative orthologs contains a non-negligible number of non-ssd-orthologs, the cut-off values that Ortholuge generates for orthology classification are difficult to interpret and can be too high, leading to decreased specificity of ssd-ortholog prediction. Therefore, we propose a complementary statistical approach to determining cut-off values. A benefit of the proposed approach is that it gives the user an estimated conditional probability that a predicted ortholog pair is unusually diverged. This enables the interpretation and selection of cut-off values based on a direct measure of the relative composition of ssd-orthologs versus non-ssd-orthologs. In a simulation comparison of the two approaches, we find that the statistical approach provides more stable cut-off values and improves the specificity of ssd-ortholog prediction for low-quality data sets of predicted orthologs.

Suggested Citation

  • Min, Jeong Eun & Whiteside, Matthew D. & Brinkman, Fiona S.L. & McNeney, Brad & Graham, Jinko, 2011. "A statistical approach to high-throughput screening of predicted orthologs," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 935-943, January.
  • Handle: RePEc:eee:csdana:v:55:y:2011:i:1:p:935-943
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(10)00317-8
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Efron, Bradley, 2004. "Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 96-104, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pounds Stanley B. & Gao Cuilan L. & Zhang Hui, 2012. "Empirical Bayesian Selection of Hypothesis Testing Procedures for Analysis of Sequence Count Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(5), pages 1-32, October.
    2. Shigeyuki Matsui & Hisashi Noma, 2011. "Estimating Effect Sizes of Differentially Expressed Genes for Power and Sample-Size Assessments in Microarray Experiments," Biometrics, The International Biometric Society, vol. 67(4), pages 1225-1235, December.
    3. van Wieringen, Wessel N. & Stam, Koen A. & Peeters, Carel F.W. & van de Wiel, Mark A., 2020. "Updating of the Gaussian graphical model through targeted penalized estimation," Journal of Multivariate Analysis, Elsevier, vol. 178(C).
    4. Ian W. McKeague & Min Qian, 2015. "An Adaptive Resampling Test for Detecting the Presence of Significant Predictors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1422-1433, December.
    5. Han, Bing & Dalal, Siddhartha R., 2012. "A Bernstein-type estimator for decreasing density with application to p-value adjustments," Computational Statistics & Data Analysis, Elsevier, vol. 56(2), pages 427-437.
    6. He, Yi & Pan, Wei & Lin, Jizhen, 2006. "Cluster analysis using multivariate normal mixture models to detect differential gene expression with microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 641-658, November.
    7. Cheng, Cheng, 2009. "Internal validation inferences of significant genomic features in genome-wide screening," Computational Statistics & Data Analysis, Elsevier, vol. 53(3), pages 788-800, January.
    8. Xiang, Qinfang & Edwards, Jode & Gadbury, Gary L., 2006. "Interval estimation in a finite mixture model: Modeling P-values in multiple testing applications," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 570-586, November.
    9. Gordon, Alexander & Chen, Linlin & Glazko, Galina & Yakovlev, Andrei, 2009. "Balancing type one and two errors in multiple testing for differential expression of genes," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1622-1629, March.
    10. Ruggieri, Eric & Lawrence, Charles E., 2012. "On efficient calculations for Bayesian variable selection," Computational Statistics & Data Analysis, Elsevier, vol. 56(6), pages 1319-1332.
    11. Yi-Hui Zhou & Paul Brooks & Xiaoshan Wang, 2018. "A Two-Stage Hidden Markov Model Design for Biomarker Detection, with Application to Microbiome Research," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 10(1), pages 41-58, April.
    12. Woo, Chi-Keung & Horowitz, Ira & Olson, Arne & Horii, Brian & Baskette, Carmen, 2006. "Efficient frontiers for electricity procurement by an LDC with multiple purchase options," Omega, Elsevier, vol. 34(1), pages 70-80, January.
    13. Bickel David R., 2012. "Empirical Bayes Interval Estimates that are Conditionally Equal to Unadjusted Confidence Intervals or to Default Prior Credibility Intervals," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(3), pages 1-34, February.
    14. Lee, Donghwan & Lee, Youngjo, 2016. "Extended likelihood approach to multiple testing with directional error control under a hidden Markov random field model," Journal of Multivariate Analysis, Elsevier, vol. 151(C), pages 1-13.
    15. Sander Greenland, 2005. "Discussion on "Statistical Issues Arising in the Women's Health Initiative"," Biometrics, The International Biometric Society, vol. 61(4), pages 920-921, December.
    16. Leek Jeffrey T & Storey John D., 2011. "The Joint Null Criterion for Multiple Hypothesis Tests," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-22, June.
    17. Lu, Zexian & Chen, Yunxiao & Li, Xiaoou, 2022. "Optimal parallel sequential change detection under generalized performance measures," LSE Research Online Documents on Economics 118348, London School of Economics and Political Science, LSE Library.
    18. Chen, Yunxiao & Lee, Yi-Hsuan & Li, Xiaoou, 2022. "Item pool quality control in educational testing: change point model, compound risk, and sequential detection," LSE Research Online Documents on Economics 112498, London School of Economics and Political Science, LSE Library.
    19. repec:cte:wsrepe:ws133228 is not listed on IDEAS
    20. Lim Johan & Kim Jayoun & Kim Sang-cheol & Yu Donghyeon & Kim Kyunga & Kim Byung Soo, 2012. "Detection of Differentially Expressed Gene Sets in a Partially Paired Microarray Data Set," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(3), pages 1-30, February.
    21. Yu, Chang & Zelterman, Daniel, 2017. "A parametric model to estimate the proportion from true null using a distribution for p-values," Computational Statistics & Data Analysis, Elsevier, vol. 114(C), pages 105-118.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:1:p:935-943. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.