IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1008770.html
   My bibliography  Save this article

Principles for data analysis workflows

Author

Listed:
  • Sara Stoudt
  • Váleri N Vásquez
  • Ciera C Martinez

Abstract

A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.

Suggested Citation

  • Sara Stoudt & Váleri N Vásquez & Ciera C Martinez, 2021. "Principles for data analysis workflows," PLOS Computational Biology, Public Library of Science, vol. 17(3), pages 1-26, March.
  • Handle: RePEc:plo:pcbi00:1008770
    DOI: 10.1371/journal.pcbi.1008770
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008770
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008770&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1008770?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Victoria Stodden & Jennifer Seiler & Zhaokun Ma, 2018. "An empirical analysis of journal policy effectiveness for computational reproducibility," Proceedings of the National Academy of Sciences, Proceedings of the National Academy of Sciences, vol. 115(11), pages 2584-2589, March.
    2. Santiago Schnell, 2015. "Ten Simple Rules for a Computational Biologist’s Laboratory Notebook," PLOS Computational Biology, Public Library of Science, vol. 11(9), pages 1-5, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Antonio Páez, 2021. "Open spatial sciences: an introduction," Journal of Geographical Systems, Springer, vol. 23(4), pages 467-476, October.
    2. Rat für Sozial- und Wirtschaftsdaten RatSWD (ed.), 2023. "Erhebung und Nutzung unstrukturierter Daten in den Sozial-, Verhaltens- und Wirtschaftswissenschaften," RatSWD Output Series, German Data Forum (RatSWD), volume 7, number 7-2de.
    3. Emily A Lescak & Kate M O’Neill & Giovanna M Collu & Subhamoy Das, 2019. "Ten simple rules for providing a meaningful research experience to high school students," PLOS Computational Biology, Public Library of Science, vol. 15(4), pages 1-7, April.
    4. Nikolas I Krieger & Adam T Perzynski & Jarrod E Dalton, 2019. "Facilitating reproducible project management and manuscript development in team science: The projects R package," PLOS ONE, Public Library of Science, vol. 14(7), pages 1-9, July.
    5. Lars Ole Schwen & Sabrina Rueschenbaum, 2018. "Ten quick tips for getting the most scientific value out of numerical data," PLOS Computational Biology, Public Library of Science, vol. 14(10), pages 1-21, October.
    6. Schweinsberg, Martin & Feldman, Michael & Staub, Nicola & van den Akker, Olmo R. & van Aert, Robbie C.M. & van Assen, Marcel A.L.M. & Liu, Yang & Althoff, Tim & Heer, Jeffrey & Kale, Alex & Mohamed, Z, 2021. "Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis," Organizational Behavior and Human Decision Processes, Elsevier, vol. 165(C), pages 228-249.
    7. Felix Holzmeister & Magnus Johannesson & Robert Böhm & Anna Dreber & Jürgen Huber & Michael Kirchler, 2023. "Heterogeneity in effect size estimates: Empirical evidence and practical implications," Working Papers 2023-17, Faculty of Economics and Statistics, Universität Innsbruck.
    8. Thu-Mai Christian & Amanda Gooch & Todd Vision & Elizabeth Hull, 2020. "Journal data policies: Exploring how the understanding of editors and authors corresponds to the policies themselves," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-15, March.
    9. Shawn J Leroux, 2019. "On the prevalence of uninformative parameters in statistical models applying model selection in applied ecology," PLOS ONE, Public Library of Science, vol. 14(2), pages 1-12, February.
    10. Vlaeminck, Sven, 2021. "Dawning of a New Age? Economics Journals’ Data Policies on the Test Bench," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 31(1), pages 1-29.
    11. Ruiz-Benito, Paloma & Vacchiano, Giorgio & Lines, Emily R. & Reyer, Christopher P.O. & Ratcliffe, Sophia & Morin, Xavier & Hartig, Florian & Mäkelä, Annikki & Yousefpour, Rasoul & Chaves, Jimena E. & , 2020. "Available and missing data to model impact of climate change on European forests," Ecological Modelling, Elsevier, vol. 416(C).
    12. Anindya Roy Chowdhury & Gouri Gargate, 2024. "Intellectual Property Management in Academic and Research Organizations: The Role of a Laboratory Notebook," Vikalpa: The Journal for Decision Makers, , vol. 49(1), pages 45-66, March.
    13. Christophe Hurlin & Christophe Pérignon, 2020. "Reproducibility Certification in Economics Research," Working Papers hal-02896404, HAL.
    14. Heidi Seibold & Severin Czerny & Siona Decke & Roman Dieterle & Thomas Eder & Steffen Fohr & Nico Hahn & Rabea Hartmann & Christoph Heindl & Philipp Kopper & Dario Lepke & Verena Loidl & Maximilian Ma, 2021. "A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses," PLOS ONE, Public Library of Science, vol. 16(6), pages 1-15, June.
    15. Jessica L Couture & Rachael E Blake & Gavin McDonald & Colette L Ward, 2018. "A funder-imposed data publication requirement seldom inspired data sharing," PLOS ONE, Public Library of Science, vol. 13(7), pages 1-13, July.
    16. Ana Trisovic & Katherine Mika & Ceilyn Boyd & Sebastian Feger & Mercè Crosas, 2021. "Repository Approaches to Improving the Quality of Shared Data and Code," Data, MDPI, vol. 6(2), pages 1-12, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008770. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.