IDEAS home Printed from https://ideas.repec.org/a/inm/orisre/v32y2021i2p462-480.html
   My bibliography  Save this article

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Author

Listed:
  • Mengke Qiao

    (International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China)

  • Ke-Wei Huang

    (Department of Information Systems and Analytics, National University of Singapore, Singapore 117417)

Abstract

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent or dependent variables for further analysis via standard regression methods. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent regression coefficient estimates in the following phase, which has been documented as a problem of measurement error in the econometrics literature. The pioneering discussions on the issue of estimation inconsistency because of misclassification in these studies have been provided. Our study attempts to investigate systematically the theoretical foundation of this problem when a newly constructed variable is used as the independent or dependent variable in linear and nonlinear regressions. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the inconsistency of estimated regression coefficients because the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive a consistent and most precise estimator in all models studied in this paper.

Suggested Citation

  • Mengke Qiao & Ke-Wei Huang, 2021. "Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 32(2), pages 462-480, June.
  • Handle: RePEc:inm:orisre:v:32:y:2021:i:2:p:462-480
    DOI: 10.1287/isre.2020.0977
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/isre.2020.0977
    Download Restriction: no

    File URL: https://libkey.io/10.1287/isre.2020.0977?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Aigner, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," Journal of Econometrics, Elsevier, vol. 1(1), pages 49-59, March.
    2. Rohit Aggarwal & Ram Gopal & Alok Gupta & Harpreet Singh, 2012. "Putting Money Where the Mouths Are: The Relation Between Venture Financing and Electronic Word-of-Mouth," Information Systems Research, INFORMS, vol. 23(3-part-2), pages 976-992, September.
    3. Hausman, J. A. & Abrevaya, Jason & Scott-Morton, F. M., 1998. "Misclassification of the dependent variable in a discrete-response setting," Journal of Econometrics, Elsevier, vol. 87(2), pages 239-269, September.
    4. Param Vir Singh & Nachiketa Sahoo & Tridas Mukhopadhyay, 2014. "How to Attract and Retain Readers in Enterprise Blogging?," Information Systems Research, INFORMS, vol. 25(1), pages 35-52, March.
    5. Helmut Küchenhoff & Samuel M. Mwalili & Emmanuel Lesaffre, 2006. "A General Method for Dealing with Misclassification in Regression: The Misclassification SIMEX," Biometrics, The International Biometric Society, vol. 62(1), pages 85-96, March.
    6. Jerry Hausman, 2001. "Mismeasured Variables in Econometric Analysis: Problems from the Right and Problems from the Left," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 57-67, Fall.
    7. Bin Gu & Prabhudev Konana & Rajagopal Raghunathan & Hsuanwei Michelle Chen, 2014. "Research Note —The Allure of Homophily in Social Media: Evidence from Investor Responses on Virtual Communities," Information Systems Research, INFORMS, vol. 25(3), pages 604-617, September.
    8. Tawei Wang & Karthik N. Kannan & Jackie Rees Ulmer, 2013. "The Association Between the Disclosure and the Realization of Information Security Risk Factors," Information Systems Research, INFORMS, vol. 24(2), pages 201-218, June.
    9. Anindya Ghose & Panagiotis G. Ipeirotis & Beibei Li, 2012. "Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowdsourced Content," Marketing Science, INFORMS, vol. 31(3), pages 493-520, May.
    10. Antonio Moreno & Christian Terwiesch, 2014. "Doing Business with Strangers: Reputation in Online Service Marketplaces," Information Systems Research, INFORMS, vol. 25(4), pages 865-886, December.
    11. AIGNER, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," LIDAM Reprints CORE 130, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    12. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    13. Bound, John & Brown, Charles & Duncan, Greg J & Rodgers, Willard L, 1994. "Evidence on the Validity of Cross-Sectional and Longitudinal Labor Market Data," Journal of Labor Economics, University of Chicago Press, vol. 12(3), pages 345-368, July.
    14. Balakrishnan, Ramji & Qiu, Xin Ying & Srinivasan, Padmini, 2010. "On the predictive ability of narrative disclosures in annual reports," European Journal of Operational Research, Elsevier, vol. 202(3), pages 789-801, May.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yi Yang & Kunpeng Zhang & Yangyang Fan, 2023. "sDTM: A Supervised Bayesian Deep Topic Model for Text Analytics," Information Systems Research, INFORMS, vol. 34(1), pages 137-156, March.
    2. Milan Miric & Nan Jia & Kenneth G. Huang, 2023. "Using supervised machine learning for large‐scale classification in management research: The case for identifying artificial intelligence patents," Strategic Management Journal, Wiley Blackwell, vol. 44(2), pages 491-519, February.
    3. Gordon Burtch & Edward McFowland III & Mochen Yang & Gediminas Adomavicius, 2023. "EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference," Papers 2303.02820, arXiv.org, revised Dec 2024.
    4. Hyelim Oh & Khim-Yong Goh & Tuan Q. Phan, 2023. "Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media Sharing," Information Systems Research, INFORMS, vol. 34(1), pages 111-136, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mochen Yang & Edward McFowland & Gordon Burtch & Gediminas Adomavicius, 2022. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," INFORMS Joural on Data Science, INFORMS, vol. 1(2), pages 138-155, October.
    2. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    3. Adele Bergin, 2015. "Employer Changes and Wage Changes: Estimation with Measurement Error in a Binary Variable," LABOUR, CEIS, vol. 29(2), pages 194-223, June.
    4. Leah K. Lakdawala & David Simon, 2016. "The Intergenerational Consequences of Tobacco Policy," Working papers 2016-27, University of Connecticut, Department of Economics.
    5. Adele Bergin, 2013. "Job Changes and Wage Changes: Estimation with Measurement Error in a Binary Variable," Economics Department Working Paper Series n240-13.pdf, Department of Economics, National University of Ireland - Maynooth.
    6. Nguimkeu, Pierre & Denteh, Augustine & Tchernis, Rusty, 2019. "On the estimation of treatment effects with endogenous misreporting," Journal of Econometrics, Elsevier, vol. 208(2), pages 487-506.
    7. Marianne Page, 2006. "Father's Education and Children's Human Capital: Evidence from the World War II GI Bill," Working Papers 84, University of California, Davis, Department of Economics.
    8. Kyung Min Kang & Robert A. Moffitt, 2019. "The Effect of SNAP and School Food Programs on Food Security, Diet Quality, and Food Spending: Sensitivity to Program Reporting Error," Southern Economic Journal, John Wiley & Sons, vol. 86(1), pages 156-201, July.
    9. Craig Gundersen & Brent Kreider, 2008. "Food Stamps and Food Insecurity: What Can Be Learned in the Presence of Nonclassical Measurement Error?," Journal of Human Resources, University of Wisconsin Press, vol. 43(2), pages 352-382.
    10. Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
    11. Dominik Gutt & Jürgen Neumann & Steffen Zimmermann & Dennis Kundisch & Jianqing Chen, 2018. "Design of Review Systems - A Strategic Instrument to shape Online Review Behavior and Economic Outcomes," Working Papers Dissertations 42, Paderborn University, Faculty of Business Administration and Economics.
    12. Frazis, Harley & Loewenstein, Mark A., 2003. "Estimating linear regressions with mismeasured, possibly endogenous, binary explanatory variables," Journal of Econometrics, Elsevier, vol. 117(1), pages 151-178, November.
    13. Wossen, Tesfamicheal & Abay, Kibrom A. & Abdoulaye, Tahirou, 2022. "Misperceiving and misreporting input quality: Implications for input use and productivity," Journal of Development Economics, Elsevier, vol. 157(C).
    14. Augustine Denteh & D'esir'e K'edagni, 2022. "Misclassification in Difference-in-differences Models," Papers 2207.11890, arXiv.org, revised Jul 2022.
    15. Philip Oreopoulos & Marianne E. Page, 2006. "The Intergenerational Effects of Compulsory Schooling," Journal of Labor Economics, University of Chicago Press, vol. 24(4), pages 729-760, October.
    16. Dan A. Black & Lars Skipper & Jeffrey A. Smith & Jeffrey Andrew Smith, 2023. "Firm Training," CESifo Working Paper Series 10268, CESifo.
    17. Cascio, Elizabeth U., 2005. "School Progression and the Grade Distribution of Students: Evidence from the Current Population Survey," IZA Discussion Papers 1747, Institute of Labor Economics (IZA).
    18. Lorenzo Almada & Ian McCarthy & Rusty Tchernis, 2016. "What Can We Learn about the Effects of Food Stamps on Obesity in the Presence of Misreporting?," American Journal of Agricultural Economics, Agricultural and Applied Economics Association, vol. 98(4), pages 997-1017.
    19. Winter, Joachim, 0000. "Bracketing effects in categorized survey questions and the measurement of economic quantities," Sonderforschungsbereich 504 Publications 02-35, Sonderforschungsbereich 504, Universität Mannheim;Sonderforschungsbereich 504, University of Mannheim.
    20. Christian vom Lehn & Cache Ellsworth & Zachary Kroff, 2022. "Reconciling Occupational Mobility in the Current Population Survey," Journal of Labor Economics, University of Chicago Press, vol. 40(4), pages 1005-1051.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orisre:v:32:y:2021:i:2:p:462-480. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.