IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1006337.html
   My bibliography  Save this article

Crowdsourcing image analysis for plant phenomics to generate ground truth data for machine learning

Author

Listed:
  • Naihui Zhou
  • Zachary D Siegel
  • Scott Zarecor
  • Nigel Lee
  • Darwin A Campbell
  • Carson M Andorf
  • Dan Nettleton
  • Carolyn J Lawrence-Dill
  • Baskar Ganapathysubramanian
  • Jonathan W Kelly
  • Iddo Friedberg

Abstract

The accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Author summary: Food security is a growing global concern. Farmers, plant breeders, and geneticists are hastening to address the challenges presented to agriculture by climate change, dwindling arable land, and population growth. Scientists in the field of plant phenomics are using satellite and drone images to understand how crops respond to a changing environment and to combine genetics and environmental measures to maximize crop growth efficiency. However, the terabytes of image data require new computational methods to extract useful information. Machine learning algorithms are effective in recognizing select parts of images, but they require high quality data curated by people to train them, a process that can be laborious and costly. We examined how well crowdsourcing works in providing training data for plant phenomics, specifically, segmenting a corn tassel—the male flower of the corn plant—from the often-cluttered images of a cornfield. We provided images to students, and to Amazon MTurkers, the latter being an on-demand workforce brokered by Amazon.com and paid on a task-by-task basis. We report on best practices in crowdsourcing image labeling for phenomics, and compare the different groups on measures such as fatigue and accuracy over time. We find that crowdsourcing is a good way of generating quality labeled data, rivaling that of experts.

Suggested Citation

  • Naihui Zhou & Zachary D Siegel & Scott Zarecor & Nigel Lee & Darwin A Campbell & Carson M Andorf & Dan Nettleton & Carolyn J Lawrence-Dill & Baskar Ganapathysubramanian & Jonathan W Kelly & Iddo Fried, 2018. "Crowdsourcing image analysis for plant phenomics to generate ground truth data for machine learning," PLOS Computational Biology, Public Library of Science, vol. 14(7), pages 1-16, July.
  • Handle: RePEc:plo:pcbi00:1006337
    DOI: 10.1371/journal.pcbi.1006337
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006337
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006337&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1006337?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Seth Cooper & Firas Khatib & Adrien Treuille & Janos Barbero & Jeehyung Lee & Michael Beenen & Andrew Leaver-Fay & David Baker & Zoran Popović & Foldit players, 2010. "Predicting protein structures with a multiplayer online game," Nature, Nature, vol. 466(7307), pages 756-760, August.
    2. Alexander Kawrykow & Gary Roumanis & Alfred Kam & Daniel Kwak & Clarence Leung & Chu Wu & Eleyine Zarour & Phylo players & Luis Sarmenta & Mathieu Blanchette & Jérôme Waldispühl, 2012. "Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment," PLOS ONE, Public Library of Science, vol. 7(3), pages 1-9, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Mansoureh Maadi & Hadi Akbarzadeh Khorshidi & Uwe Aickelin, 2021. "A Review on Human–AI Interaction in Machine Learning and Insights for Medical Applications," IJERPH, MDPI, vol. 18(4), pages 1-27, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Christoph Safferling & Aaron Lowen, 2011. "Economics in the Kingdom of Loathing: Analysis of Virtual Market Data," Working Paper Series of the Department of Economics, University of Konstanz 2011-30, Department of Economics, University of Konstanz.
    2. Prpić, John & Shukla, Prashant P. & Kietzmann, Jan H. & McCarthy, Ian P., 2015. "How to work a crowd: Developing crowd capital through crowdsourcing," Business Horizons, Elsevier, vol. 58(1), pages 77-85.
    3. Kovacs, Attila, 2018. "Gender Differences in Equity Crowdfunding," OSF Preprints 5pcmb, Center for Open Science.
    4. Matthew Staffelbach & Peter Sempolinski & Tracy Kijewski-Correa & Douglas Thain & Daniel Wei & Ahsan Kareem & Gregory Madey, 2015. "Lessons Learned from Crowdsourcing Complex Engineering Tasks," PLOS ONE, Public Library of Science, vol. 10(9), pages 1-19, September.
    5. Spartaco Albertarelli & Piero Fraternali & Sergio Herrera & Mark Melenhorst & Jasminko Novak & Chiara Pasini & Andrea-Emilio Rizzoli & Cristina Rottondi, 2018. "A Survey on the Design of Gamified Systems for Energy and Water Sustainability," Games, MDPI, vol. 9(3), pages 1-34, June.
    6. Robert Swain & Alex Berger & Josh Bongard & Paul Hines, 2015. "Participation and Contribution in Crowdsourced Surveys," PLOS ONE, Public Library of Science, vol. 10(4), pages 1-21, April.
    7. Franzoni, Chiara & Sauermann, Henry, 2014. "Crowd science: The organization of scientific research in open collaborative projects," Research Policy, Elsevier, vol. 43(1), pages 1-20.
    8. Sam Mavandadi & Stoyan Dimitrov & Steve Feng & Frank Yu & Uzair Sikora & Oguzhan Yaglidere & Swati Padmanabhan & Karin Nielsen & Aydogan Ozcan, 2012. "Distributed Medical Image Analysis and Diagnosis through Crowd-Sourced Games: A Malaria Case Study," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-8, May.
    9. Sherwani, Y & Ahmed, M & Muntasir, M & El-Hilly, A & Iqbal, S & Siddiqui, S & Al-Fagih, Z & Usmani, O & Eisingerich, AB, 2015. "Examining the role of gamification and use of mHealth apps in the context of smoking cessation: A review of extant knowledge and outlook," Working Papers 25458, Imperial College, London, Imperial College Business School.
    10. Joanna Chataway & Sarah Parks & Elta Smith, 2017. "How Will Open Science Impact on University-Industry Collaboration?," Foresight and STI Governance (Foresight-Russia till No. 3/2015), National Research University Higher School of Economics, vol. 11(2), pages 44-53.
    11. Ayat Abourashed & Laura Doornekamp & Santi Escartin & Constantianus J. M. Koenraadt & Maarten Schrama & Marlies Wagener & Frederic Bartumeus & Eric C. M. van Gorp, 2021. "The Potential Role of School Citizen Science Programs in Infectious Disease Surveillance: A Critical Review," IJERPH, MDPI, vol. 18(13), pages 1-18, June.
    12. Jennifer Lewis Priestley & Robert J. McGrath, 2019. "The Evolution of Data Science: A New Mode of Knowledge Production," International Journal of Knowledge Management (IJKM), IGI Global, vol. 15(2), pages 97-109, April.
    13. Vito D’Orazio & Michael Kenwick & Matthew Lane & Glenn Palmer & David Reitter, 2016. "Crowdsourcing the Measurement of Interstate Conflict," PLOS ONE, Public Library of Science, vol. 11(6), pages 1-21, June.
    14. Yury Kryvasheyeu & Haohui Chen & Esteban Moro & Pascal Van Hentenryck & Manuel Cebrian, 2015. "Performance of Social Network Sensors during Hurricane Sandy," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-19, February.
    15. Barbara Strobl & Simon Etter & Ilja van Meerveld & Jan Seibert, 2019. "The CrowdWater game: A playful way to improve the accuracy of crowdsourced water level class data," PLOS ONE, Public Library of Science, vol. 14(9), pages 1-23, September.
    16. Prpić, John, 2017. "How To Work A Crowd: Developing Crowd Capital Through Crowdsourcing," SocArXiv jer9k, Center for Open Science.
    17. Andrei P. Kirilenko & Travis Desell & Hany Kim & Svetlana Stepchenkova, 2017. "Crowdsourcing Analysis of Twitter Data on Climate Change: Paid Workers vs. Volunteers," Sustainability, MDPI, vol. 9(11), pages 1-15, November.
    18. Siluo Yang & Dietmar Wolfram & Feifei Wang, 2017. "The relationship between the author byline and contribution lists: a comparison of three general medical journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1273-1296, March.
    19. Maryam Lotfian & Jens Ingensand & Maria Antonia Brovelli, 2021. "The Partnership of Citizen Science and Machine Learning: Benefits, Risks, and Future Challenges for Engagement, Data Collection, and Data Quality," Sustainability, MDPI, vol. 13(14), pages 1-19, July.
    20. Jonathan R Karr & Alex H Williams & Jeremy D Zucker & Andreas Raue & Bernhard Steiert & Jens Timmer & Clemens Kreutz & DREAM8 Parameter Estimation Challenge Consortium & Simon Wilkinson & Brandon A Al, 2015. "Summary of the DREAM8 Parameter Estimation Challenge: Toward Parameter Identification for Whole-Cell Models," PLOS Computational Biology, Public Library of Science, vol. 11(5), pages 1-21, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1006337. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.