IDEAS home Printed from https://ideas.repec.org/a/spr/infosf/vyid10.1007_s10796-020-10022-7.html
   My bibliography  Save this article

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Author

Listed:
  • Justin M. Johnson

    (Florida Atlantic University)

  • Taghi M. Khoshgoftaar

    (Florida Atlantic University)

Abstract

Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%–90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems.

Suggested Citation

  • Justin M. Johnson & Taghi M. Khoshgoftaar, 0. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 0, pages 1-19.
  • Handle: RePEc:spr:infosf:v::y::i::d:10.1007_s10796-020-10022-7
    DOI: 10.1007/s10796-020-10022-7
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10796-020-10022-7
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10796-020-10022-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. José I. Requeno & José Merseguer & Simona Bernardi & Diego Perez-Palacin & Giorgos Giotis & Vasilis Papanikolaou, 2019. "Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study," Information Systems Frontiers, Springer, vol. 21(1), pages 67-85, February.
    2. Atreyi Kankanhalli & Jungpil Hahn & Sharon Tan & Gordon Gao, 2016. "Big data and analytics in healthcare: Introduction to the special section," Information Systems Frontiers, Springer, vol. 18(2), pages 233-235, April.
    3. Taghi M. Khoshgoftaar & Kehan Gao & Amri Napolitano & Randall Wald, 2014. "A comparative study of iterative and non-iterative feature selection techniques for software defect prediction," Information Systems Frontiers, Springer, vol. 16(5), pages 801-822, November.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lydia Bouzar-Benlabiod & Stuart H. Rubin, 2020. "Heuristic Acquisition for Data Science," Information Systems Frontiers, Springer, vol. 22(5), pages 1001-1007, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Justin M. Johnson & Taghi M. Khoshgoftaar, 2020. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 22(5), pages 1113-1131, October.
    2. Bram Klievink & Bart-Jan Romijn & Scott Cunningham & Hans Bruijn, 2017. "Big data in the public sector: Uncertainties and readiness," Information Systems Frontiers, Springer, vol. 19(2), pages 267-283, April.
    3. Venugopal Gopalakrishna-Remani & Robert Paul Jones & Kerri M. Camp, 2019. "Levels of EMR Adoption in U.S. Hospitals: An Empirical Examination of Absorptive Capacity, Institutional Pressures, Top Management Beliefs, and Participation," Information Systems Frontiers, Springer, vol. 21(6), pages 1325-1344, December.
    4. Saba Bashir & Usman Qamar & Farhan Hassan Khan, 2018. "WebMAC: A web based clinical expert system," Information Systems Frontiers, Springer, vol. 20(5), pages 1135-1151, October.
    5. Ashish Gupta & Amit Deokar & Lakshmi Iyer & Ramesh Sharda & Dave Schrader, 2018. "Big Data & Analytics for Societal Impact: Recent Research and Trends," Information Systems Frontiers, Springer, vol. 20(2), pages 185-194, April.
    6. Yogita Khatri & Sandeep Kumar Singh, 2023. "An effective feature selection based cross-project defect prediction model for software quality improvement," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 14(1), pages 154-172, March.
    7. Qizhi Tao & Yizhe Dong & Ziming Lin, 2017. "Who can get money? Evidence from the Chinese peer-to-peer lending platform," Information Systems Frontiers, Springer, vol. 19(3), pages 425-441, June.
    8. Yiğit Kazançoğlu & Muhittin Sağnak & Çisem Lafcı & Sunil Luthra & Anil Kumar & Caner Taçoğlu, 2021. "Big Data-Enabled Solutions Framework to Overcoming the Barriers to Circular Economy Initiatives in Healthcare Sector," IJERPH, MDPI, vol. 18(14), pages 1-21, July.
    9. Thouraya Bouabana-Tebibel & Stuart H. Rubin & Lydia Bouzar-Benlabiod, 2019. "Guest Editorial: Recent Trends in Reuse and Integration," Information Systems Frontiers, Springer, vol. 21(1), pages 1-3, February.
    10. Saba Bashir & Usman Qamar & Farhan Hassan Khan, 0. "WebMAC: A web based clinical expert system," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    11. Prabhsimran Singh & Surleen Kaur & Abdullah M. Baabdullah & Yogesh K. Dwivedi & Sandeep Sharma & Ravinder Singh Sawhney & Ronnie Das, 2023. "Is #SDG13 Trending Online? Insights from Climate Change Discussions on Twitter," Information Systems Frontiers, Springer, vol. 25(1), pages 199-219, February.
    12. Qizhi Tao & Yizhe Dong & Ziming Lin, 0. "Who can get money? Evidence from the Chinese peer-to-peer lending platform," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    13. Yogesh K. Dwivedi & Marijn Janssen & Emma L. Slade & Nripendra P. Rana & Vishanth Weerakkody & Jeremy Millard & Jan Hidders & Dhoya Snijders, 2017. "Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling," Information Systems Frontiers, Springer, vol. 19(2), pages 197-212, April.
    14. Chengcui Zhang & Elisa Bertino & Bhavani Thuraisingham & James Joshi, 2014. "Guest editorial: Information reuse, integration, and reusable systems," Information Systems Frontiers, Springer, vol. 16(5), pages 749-752, November.
    15. Hsu-Hua Ho & Jien-Jou Lin & Jia-Qiao Gong & Tzu-Yi Yu, 2022. "An Empirical Study for Senior Citizens Using a Customized Medical Informatics System for Dementia Diagnosis and Analysis," Sustainability, MDPI, vol. 14(15), pages 1-22, July.
    16. Yogesh K. Dwivedi & Marijn Janssen & Emma L. Slade & Nripendra P. Rana & Vishanth Weerakkody & Jeremy Millard & Jan Hidders & Dhoya Snijders, 0. "Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
    17. Bram Klievink & Bart-Jan Romijn & Scott Cunningham & Hans Bruijn, 0. "Big data in the public sector: Uncertainties and readiness," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    18. Lei, Yunliang, 2024. "Enhancing environmental management through big data: spatial analysis of urban ecological governance and big data development," LSE Research Online Documents on Economics 122571, London School of Economics and Political Science, LSE Library.
    19. Mamoun T. Mardini & Zbigniew W. Raś, 2022. "Discovering Primary Medical Procedures and their Associations with Other Procedures in HCUP Data," Information Systems Frontiers, Springer, vol. 24(1), pages 133-147, February.
    20. Bag, Surajit & Dhamija, Pavitra & Singh, Rajesh Kumar & Rahman, Muhammad Sabbir & Sreedharan, V. Raja, 2023. "Big data analytics and artificial intelligence technologies based collaborative platform empowering absorptive capacity in health care supply chain: An empirical study," Journal of Business Research, Elsevier, vol. 154(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infosf:v::y::i::d:10.1007_s10796-020-10022-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.