IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0228154.html
   My bibliography  Save this article

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

Author

Listed:
  • Charlotte S C Woolley
  • Ian G Handel
  • B Mark Bronsvoort
  • Jeffrey J Schoenebeck
  • Dylan N Clements

Abstract

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

Suggested Citation

  • Charlotte S C Woolley & Ian G Handel & B Mark Bronsvoort & Jeffrey J Schoenebeck & Dylan N Clements, 2020. "Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data," PLOS ONE, Public Library of Science, vol. 15(1), pages 1-21, January.
  • Handle: RePEc:plo:pone00:0228154
    DOI: 10.1371/journal.pone.0228154
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0228154
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228154&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0228154?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kim, J. & Must, A. & Fitzmaurice, G.M. & Gillman, M.W. & Chomitz, V. & Kramer, E. & McGowan, R. & Peterson, K.E., 2005. "Incidence and remission rates of overweight among children aged 5 to 13 years in a district-wide school surveillance system," American Journal of Public Health, American Public Health Association, vol. 95(9), pages 1588-1594.
    2. H. Goldstein, 1970. "Data Processing for Longitudinal Studies," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 19(2), pages 145-151, June.
    3. David Bann & William Johnson & Leah Li & Diana Kuh & Rebecca Hardy, 2017. "Socioeconomic Inequalities in Body Mass Index across Adulthood: Coordinated Analyses of Individual Participant Data from Three British Birth Cohort Studies Initiated in 1946, 1958 and 1970," PLOS Medicine, Public Library of Science, vol. 14(1), pages 1-20, January.
    4. Jan Van den Broeck & Solveig Argeseanu Cunningham & Roger Eeckels & Kobus Herbst, 2005. "Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities," PLOS Medicine, Public Library of Science, vol. 2(10), pages 1-1, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Prabhsimran Singh & Yogesh K. Dwivedi & Karanjeet Singh Kahlon & Ravinder Singh Sawhney & Ali Abdallah Alalwan & Nripendra P. Rana, 0. "Smart Monitoring and Controlling of Government Policies Using Social Media and Cloud Computing," Information Systems Frontiers, Springer, vol. 0, pages 1-23.
    2. Dawid Gondek & Ke Ning & George B Ploubidis & Bilal Nasim & Alissa Goodman, 2018. "The impact of health on economic and social outcomes in the United Kingdom: A scoping literature review," PLOS ONE, Public Library of Science, vol. 13(12), pages 1-21, December.
    3. Rössler, E. & Sokolov, A.P. & Eiermann, P. & Warschewske, U., 1993. "Dynamical phase transition in simple supercooled liquids and polymers - an NMR approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 201(1), pages 237-256.
    4. Fischer, E.W., 1993. "Light scattering and dielectric studies on glass forming liquids," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 201(1), pages 183-206.
    5. Ziwen Sun & Ka Yan Lai & Simon Bell & Iain Scott & Xiaomeng Zhang, 2019. "Exploring the Associations of Walking Behavior with Neighborhood Environments by Different Life Stages: A Cross-Sectional Study in a Smaller Chinese City," IJERPH, MDPI, vol. 17(1), pages 1-16, December.
    6. Fiona Kigen & Marike Venter de Villiers, 2024. "Decoding the Symphony of Satisfaction, Commitment and Trust as Predictors of Customer Loyalty in Demarketing Situations," International Review of Management and Marketing, Econjournals, vol. 14(5), pages 235-249, September.
    7. Wahnström, Göran & Lewis, Laurent J., 1993. "Molecular dynamics simulation of a molecular glass at intermediate times," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 201(1), pages 150-156.
    8. Frank C Mng'ong'o & Joseph J Sambali & Eustachkius Sabas & Justine Rubanga & Jaka Magoma & Alex J Ntamatungiro & Elizabeth L Turner & Daniel Nyogea & Jeroen H J Ensink & Sarah J Moore, 2011. "Repellent Plants Provide Affordable Natural Screening to Prevent Mosquito House Entry in Tropical Rural Settings—Results from a Pilot Efficacy Study," PLOS ONE, Public Library of Science, vol. 6(10), pages 1-11, October.
    9. Lara Lusa & Marianne Huebner, 2021. "Organizing and Analyzing Data from the SHARE Study with an Application to Age and Sex Differences in Depressive Symptoms," IJERPH, MDPI, vol. 18(18), pages 1-20, September.
    10. T. J. Cole, 2022. "A celebration of Harvey Goldstein’s lifetime contributions: Harvey Goldstein and his time at the Institute of Child Health," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(3), pages 748-752, July.
    11. Pasichnyi, Oleksii & Wallin, Jörgen & Levihn, Fabian & Shahrokni, Hossein & Kordas, Olga, 2019. "Energy performance certificates — New opportunities for data-enabled urban energy policy instruments?," Energy Policy, Elsevier, vol. 127(C), pages 486-499.
    12. Mariano Sana & Alexander A. Weinreb, 2008. "Insiders, Outsiders, and the Editing of Inconsistent Survey Data," Sociological Methods & Research, , vol. 36(4), pages 515-541, May.
    13. Margherita E. Ghiselli & Idongesit Nta Wilson & Brian Kaplan & Ndadilnasiya Endie Waziri & Adamu Sule & Halimatu Bolatito Ayanleke & Faruk Namalam & Shehu Ahmad Tambuwal & Nuruddeen Aliyu & Umar Kadi , 2019. "Comparison of Micro-Census Results for Magarya Ward, Wurno Local Government Area of Sokoto State, Nigeria, with Other Sources of Denominator Data," Data, MDPI, vol. 4(1), pages 1-19, January.
    14. Barry Dewitt & Baruch Fischhoff & Alexander L. Davis & Stephen B. Broomell & Mark S. Roberts & Janel Hanmer, 2019. "Exclusion Criteria as Measurements I: Identifying Invalid Responses," Medical Decision Making, , vol. 39(6), pages 693-703, August.
    15. Balan Rathakrishnan & Soon Singh Bikar Singh & Azizi Yahaya, 2022. "Perceived Social Support, Coping Strategies and Psychological Distress among University Students during the COVID-19 Pandemic: An Exploration Study for Social Sustainability in Sabah, Malaysia," Sustainability, MDPI, vol. 14(6), pages 1-13, March.
    16. James Steele & Matthew Wade & Robert J. Copeland & Stuart Stokes & Rachel Stokes & Steven Mann, 2021. "The National ReferAll Database: An Open Dataset of Exercise Referral Schemes Across the UK," IJERPH, MDPI, vol. 18(9), pages 1-17, April.
    17. Furqan Alam & Ahmed Almaghthawi & Iyad Katib & Aiiad Albeshri & Rashid Mehmood, 2021. "iResponse: An AI and IoT-Enabled Framework for Autonomous COVID-19 Pandemic Management," Sustainability, MDPI, vol. 13(7), pages 1-52, March.
    18. Prabhsimran Singh & Yogesh K. Dwivedi & Karanjeet Singh Kahlon & Ravinder Singh Sawhney & Ali Abdallah Alalwan & Nripendra P. Rana, 2020. "Smart Monitoring and Controlling of Government Policies Using Social Media and Cloud Computing," Information Systems Frontiers, Springer, vol. 22(2), pages 315-337, April.
    19. Ziwen Sun & Iain Scott & Simon Bell & Xiaomeng Zhang & Lan Wang, 2021. "Time Distances to Residential Food Amenities and Daily Walking Duration: A Cross-Sectional Study in Two Low Tier Chinese Cities," IJERPH, MDPI, vol. 18(2), pages 1-15, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0228154. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.