IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1006921.html
   My bibliography  Save this article

ChIPulate: A comprehensive ChIP-seq simulation pipeline

Author

Listed:
  • Vishaka Datta
  • Sridhar Hannenhalli
  • Rahul Siddharthan

Abstract

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at https://github.com/vishakad/chipulate.Author summary: DNA-binding proteins perform many key roles in biology, such as transcriptional regulation of gene expression and chromatin modification. ChIP-seq (Chromatin immunoprecipitation followed by high-throughput sequencing) is a widely used experimental technique to identify DNA-binding sites of specific proteins of interest, within cells, genome-wide. DNA fragments from genomic regions that are bound by a protein of interest, often a transcription factor (TF), are selectively extracted using specific antibodies, amplified using PCR, and sequenced. The sequences are mapped to the reference genome. Regions where many sequences map, called “peaks”, are used to infer the location of TF-bound loci (peaks), in vivo occupancy at those loci, and the sequence pattern (motif) to which the TF shows a binding affinity. But measurements of TF occupancy and motif inference are vulnerable to several biological and experimental sources of variation that are poorly understood and difficult to assess directly. Here, we simulate key steps of the ChIP-seq protocol with the aim of estimating the relative effects of various sources of variations on motif inference and binding affinity estimations. Besides providing specific insights and recommendations, we provide a general framework to simulate sequence reads in a ChIP-seq experiment, which should considerably aid in the development of software aimed at analyzing ChIP-seq data.

Suggested Citation

  • Vishaka Datta & Sridhar Hannenhalli & Rahul Siddharthan, 2019. "ChIPulate: A comprehensive ChIP-seq simulation pipeline," PLOS Computational Biology, Public Library of Science, vol. 15(3), pages 1-32, March.
  • Handle: RePEc:plo:pcbi00:1006921
    DOI: 10.1371/journal.pcbi.1006921
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006921
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006921&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1006921?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kuan, Pei Fen & Chung, Dongjun & Pan, Guangjin & Thomson, James A. & Stewart, Ron & Keleş, Sündüz, 2011. "A Statistical Framework for the Analysis of ChIP-Seq Data," Journal of the American Statistical Association, American Statistical Association, vol. 106(495), pages 891-903.
    2. Yue Zhao & David Granas & Gary D Stormo, 2009. "Inferring Binding Energies from Selected Binding Sites," PLOS Computational Biology, Public Library of Science, vol. 5(12), pages 1-8, December.
    3. Shuxiang Ruan & Gary D Stormo, 2017. "Inherent limitations of probabilistic models for protein-DNA binding specificity," PLOS Computational Biology, Public Library of Science, vol. 13(7), pages 1-15, July.
    4. Gabriel E. Zentner & Sivakanthan Kasinathan & Beibei Xin & Remo Rohs & Steven Henikoff, 2015. "ChEC-seq kinetics discriminates transcription factor binding sites by DNA sequence and shape in vivo," Nature Communications, Nature, vol. 6(1), pages 1-13, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dai, Hongsheng & Bao, Yanchun & Bao, Mingtang, 2013. "Maximum likelihood estimate for the dispersion parameter of the negative binomial distribution," Statistics & Probability Letters, Elsevier, vol. 83(1), pages 21-27.
    2. Dongjun Chung & Dan Park & Kevin Myers & Jeffrey Grass & Patricia Kiley & Robert Landick & Sündüz Keleş, 2013. "dPeak: High Resolution Identification of Transcription Factor Binding Sites from PET and SET ChIP-Seq Data," PLOS Computational Biology, Public Library of Science, vol. 9(10), pages 1-13, October.
    3. Guannan Sun & Rajini Srinivasan & Camila Lopez-Anido & Holly A Hung & John Svaren & Sündüz Keleş, 2014. "In Silico Pooling of ChIP-seq Control Experiments," PLOS ONE, Public Library of Science, vol. 9(11), pages 1-9, November.
    4. Claudia Coronnello & Ryan Hartmaier & Arshi Arora & Luai Huleihel & Kusum V Pandit & Abha S Bais & Michael Butterworth & Naftali Kaminski & Gary D Stormo & Steffi Oesterreich & Panayiotis V Benos, 2012. "Novel Modeling of Combinatorial miRNA Targeting Identifies SNP with Potential Role in Bone Density," PLOS Computational Biology, Public Library of Science, vol. 8(12), pages 1-13, December.
    5. Shuxiang Ruan & Gary D Stormo, 2017. "Inherent limitations of probabilistic models for protein-DNA binding specificity," PLOS Computational Biology, Public Library of Science, vol. 13(7), pages 1-15, July.
    6. Vivoda, Vlado, 2012. "Japan’s energy security predicament post-Fukushima," Energy Policy, Elsevier, vol. 46(C), pages 135-143.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1006921. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.