IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0231446.html
   My bibliography  Save this article

Blind estimation and correction of microarray batch effect

Author

Listed:
  • Sudhir Varma

Abstract

Microarray batch effect (BE) has been the primary bottleneck for large-scale integration of data from multiple experiments. Current BE correction methods either need known batch identities (ComBat) or have the potential to overcorrect, by removing true but unknown biological differences (Surrogate Variable Analysis SVA). It is well known that experimental conditions such as array or reagent batches, PCR amplification or ozone levels can affect the measured expression levels; often the direction of perturbation of the measured expression is the same in different datasets. However, there are no BE correction algorithms that attempt to estimate the individual effects of technical differences and use them to correct expression data. In this manuscript, we show that a set of signatures, each of which is a vector the length of the number of probes, calculated on a reference set of microarray samples can predict much of the batch effect in other validation sets. We present a rationale of selecting a reference set of samples designed to estimate technical differences without removing biological differences. Putting both together, we introduce the Batch Effect Signature Correction (BESC) algorithm that uses the BES calculated on the reference set to efficiently predict and remove BE. Using two independent validation sets, we show that BESC is capable of removing batch effect without removing unknown but true biological differences. Much of the variations due to batch effect is shared between different microarray datasets. That shared information can be used to predict signatures (i.e. directions of perturbation) due to batch effect in new datasets. The correction can be precomputed without using the samples to be corrected (blind), done on each sample individually (single sample) and corrects only known technical effects without removing known or unknown biological differences (conservative). Those three characteristics make it ideal for high-throughput correction of samples for a microarray data repository. We also compare the performance of BESC to three other batch correction methods: SVA, Removing Unwanted Variation (RUV) and Hidden Covariates with Prior (HCP). An R Package besc implementing the algorithm is available from http://explainbio.com.

Suggested Citation

  • Sudhir Varma, 2020. "Blind estimation and correction of microarray batch effect," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-15, April.
  • Handle: RePEc:plo:pone00:0231446
    DOI: 10.1371/journal.pone.0231446
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231446
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0231446&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0231446?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Marron, J.S. & Todd, Michael J. & Ahn, Jeongyoun, 2007. "Distance-Weighted Discrimination," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1267-1271, December.
    2. Jeffrey T Leek & John D Storey, 2007. "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-12, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marron, J.S., 2017. "Big Data in context and robustness against heterogeneity," Econometrics and Statistics, Elsevier, vol. 2(C), pages 73-80.
    2. Yugo Nakayama & Kazuyoshi Yata & Makoto Aoshima, 2020. "Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(5), pages 1257-1286, October.
    3. repec:jss:jstsof:40:i14 is not listed on IDEAS
    4. Hayley Randall & Andreas Artemiou & Xingye Qiao, 2021. "Sufficient dimension reduction based on distance‐weighted discrimination," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(4), pages 1186-1211, December.
    5. Emanuele Aliverti & Kristian Lum & James E. Johndrow & David B. Dunson, 2021. "Removing the influence of group variables in high‐dimensional predictive modelling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(3), pages 791-811, July.
    6. Seungchul Baek & Yen‐Yi Ho & Yanyuan Ma, 2020. "Using sufficient direction factor model to analyze latent activities associated with breast cancer survival," Biometrics, The International Biometric Society, vol. 76(4), pages 1340-1350, December.
    7. Griffin, Maryclare & Hoff, Peter D., 2019. "Lasso ANOVA decompositions for matrix and tensor data," Computational Statistics & Data Analysis, Elsevier, vol. 137(C), pages 181-194.
    8. Chee Ho H’ng & Shanika L. Amarasinghe & Boya Zhang & Hojin Chang & Xinli Qu & David R. Powell & Alberto Rosello-Diez, 2024. "Compensatory growth and recovery of cartilage cytoarchitecture after transient cell death in fetal mouse limbs," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    9. Mark Reimers, 2010. "Making Informed Choices about Microarray Data Analysis," PLOS Computational Biology, Public Library of Science, vol. 6(5), pages 1-7, May.
    10. Leek Jeffrey T & Storey John D., 2011. "The Joint Null Criterion for Multiple Hypothesis Tests," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-22, June.
    11. Nicoló Fusi & Oliver Stegle & Neil D Lawrence, 2012. "Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies," PLOS Computational Biology, Public Library of Science, vol. 8(1), pages 1-9, January.
    12. Yuto Hasegawa & Juhyun Kim & Gianluca Ursini & Yan Jouroukhin & Xiaolei Zhu & Yu Miyahara & Feiyi Xiong & Samskruthi Madireddy & Mizuho Obayashi & Beat Lutz & Akira Sawa & Solange P. Brown & Mikhail V, 2023. "Microglial cannabinoid receptor type 1 mediates social memory deficits in mice produced by adolescent THC exposure and 16p11.2 duplication," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    13. Anil K. Ghosh & Munmun Biswas, 2016. "Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(3), pages 525-547, September.
    14. Friguet, Chloé & Causeur, David, 2011. "Estimation of the proportion of true null hypotheses in high-dimensional data under dependence," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2665-2676, September.
    15. Xuemeng Zhou & Tsz Wing Sam & Ah Young Lee & Danny Leung, 2021. "Mouse strain-specific polymorphic provirus functions as cis-regulatory element leading to epigenomic and transcriptomic variations," Nature Communications, Nature, vol. 12(1), pages 1-18, December.
    16. Michael W Nagle & Jeanne C Latourelle & Adam Labadorf & Alexandra Dumitriu & Tiffany C Hadzi & Thomas G Beach & Richard H Myers, 2016. "The 4p16.3 Parkinson Disease Risk Locus Is Associated with GAK Expression and Genes Involved with the Synaptic Vesicle Membrane," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-14, August.
    17. Oliver Stegle & Leopold Parts & Richard Durbin & John Winn, 2010. "A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies," PLOS Computational Biology, Public Library of Science, vol. 6(5), pages 1-11, May.
    18. Jung, Sungkyu, 2018. "Continuum directions for supervised dimension reduction," Computational Statistics & Data Analysis, Elsevier, vol. 125(C), pages 27-43.
    19. Bolivar-Cime, A. & Marron, J.S., 2013. "Comparison of binary discrimination methods for high dimension low sample size data," Journal of Multivariate Analysis, Elsevier, vol. 115(C), pages 108-121.
    20. Gao Wang & Abhishek Sarkar & Peter Carbonetto & Matthew Stephens, 2020. "A simple new approach to variable selection in regression, with application to genetic fine mapping," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(5), pages 1273-1300, December.
    21. Kaido Lepik & Tarmo Annilo & Viktorija Kukuškina & eQTLGen Consortium & Kai Kisand & Zoltán Kutalik & Pärt Peterson & Hedi Peterson, 2017. "C-reactive protein upregulates the whole blood expression of CD59 - an integrative analysis," PLOS Computational Biology, Public Library of Science, vol. 13(9), pages 1-20, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0231446. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.