IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2106.10341.html
   My bibliography  Save this paper

Scalable Econometrics on Big Data -- The Logistic Regression on Spark

Author

Listed:
  • Aur'elien Ouattara
  • Matthieu Bult'e
  • Wan-Ju Lin
  • Philipp Scholl
  • Benedikt Veit
  • Christos Ziakas
  • Florian Felice
  • Julien Virlogeux
  • George Dikos

Abstract

Extra-large datasets are becoming increasingly accessible, and computing tools designed to handle huge amount of data efficiently are democratizing rapidly. However, conventional statistical and econometric tools are still lacking fluency when dealing with such large datasets. This paper dives into econometrics on big datasets, specifically focusing on the logistic regression on Spark. We review the robustness of the functions available in Spark to fit logistic regression and introduce a package that we developed in PySpark which returns the statistical summary of the logistic regression, necessary for statistical inference.

Suggested Citation

  • Aur'elien Ouattara & Matthieu Bult'e & Wan-Ju Lin & Philipp Scholl & Benedikt Veit & Christos Ziakas & Florian Felice & Julien Virlogeux & George Dikos, 2021. "Scalable Econometrics on Big Data -- The Logistic Regression on Spark," Papers 2106.10341, arXiv.org.
  • Handle: RePEc:arx:papers:2106.10341
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2106.10341
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Bluhm, Benjamin & Cutura, Jannic, 2020. "Econometrics at scale: Spark up big data in economics," SAFE Working Paper Series 266, Leibniz Institute for Financial Research SAFE.
    2. Liran Einav & Jonathan Levin, 2014. "The Data Revolution and Economic Analysis," Innovation Policy and the Economy, University of Chicago Press, vol. 14(1), pages 1-24.
    3. A. Belloni & V. Chernozhukov & L. Wang, 2011. "Square-root lasso: pivotal recovery of sparse signals via conic programming," Biometrika, Biometrika Trust, vol. 98(4), pages 791-806.
    4. Hal Varian, 2018. "Artificial Intelligence, Economics, and Industrial Organization," NBER Chapters, in: The Economics of Artificial Intelligence: An Agenda, pages 399-419, National Bureau of Economic Research, Inc.
    5. Jeffrey M Wooldridge, 2010. "Econometric Analysis of Cross Section and Panel Data," MIT Press Books, The MIT Press, edition 2, volume 1, number 0262232588, April.
    6. Fernández-Villaverde, Jesús & Zarruk Valencia , David, 2018. "A Practical Guide to Parallelization in Economics," CEPR Discussion Papers 12890, C.E.P.R. Discussion Papers.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alexandre Belloni & Victor Chernozhukov & Ivan Fernandez-Val & Christian Hansen, 2013. "Program evaluation with high-dimensional data," CeMMAP working papers CWP77/13, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    2. A. Belloni & V. Chernozhukov & I. Fernández‐Val & C. Hansen, 2017. "Program Evaluation and Causal Inference With High‐Dimensional Data," Econometrica, Econometric Society, vol. 85, pages 233-298, January.
    3. Laurent Bergé, 2018. "Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm," DEM Discussion Paper Series 18-13, Department of Economics at the University of Luxembourg.
    4. Guo, Zijian & Kang, Hyunseung & Cai, T. Tony & Small, Dylan S., 2018. "Testing endogeneity with high dimensional covariates," Journal of Econometrics, Elsevier, vol. 207(1), pages 175-187.
    5. Xi Chen & Ye Luo & Martin Spindler, 2019. "Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data," Papers 1912.12867, arXiv.org, revised Jan 2020.
    6. Bluhm, Benjamin & Cutura, Jannic, 2020. "Econometrics at scale: Spark up big data in economics," SAFE Working Paper Series 266, Leibniz Institute for Financial Research SAFE.
    7. Kaiser, Ulrich & Kuhn, Johan M., 2020. "The value of publicly available, textual and non-textual information for startup performance prediction," Journal of Business Venturing Insights, Elsevier, vol. 14(C).
    8. Wagner Piazza Gaglianone & João Victor Issler, 2014. "Microfounded Forecasting," Working Papers Series 372, Central Bank of Brazil, Research Department.
    9. Averi Chakrabarti & Karen A Grépin & Stéphane Helleringer, 2019. "The impact of supplementary immunization activities on routine vaccination coverage: An instrumental variable analysis in five low-income countries," PLOS ONE, Public Library of Science, vol. 14(2), pages 1-11, February.
    10. Harold Alderman & John Hoddinott & Bill Kinsey, 2006. "Long term consequences of early childhood malnutrition," Oxford Economic Papers, Oxford University Press, vol. 58(3), pages 450-474, July.
    11. Huh, Yesol & Kim, You Suk, 2023. "Cheapest-to-deliver pricing, optimal MBS securitization, and welfare implications," Journal of Financial Economics, Elsevier, vol. 150(1), pages 68-93.
    12. Ji Yan & Sally Brocksen, 2013. "Adolescent risk perception, substance use, and educational attainment," Journal of Risk Research, Taylor & Francis Journals, vol. 16(8), pages 1037-1055, September.
    13. Sènakpon Fidèle A. Dedehouanou & Luca Tiberti & Hilaire G. Houeninvo & Djohodo Inès Monwanou, 2019. "Working while studying: Employment premium or penalty for youth in Benin?," Working Papers PMMA 2019-03, PEP-PMMA.
    14. Mengyuan Zhou, 2022. "Does the Source of Inheritance Matter in Bequest Attitudes? Evidence from Japan," Journal of Family and Economic Issues, Springer, vol. 43(4), pages 867-887, December.
    15. Sandra Müllbacher & Wolfgang Nagl, 2017. "Labour supply in Austria: an assessment of recent developments and the effects of a tax reform," Empirica, Springer;Austrian Institute for Economic Research;Austrian Economic Association, vol. 44(3), pages 465-486, August.
    16. Campbell, Randall C. & Nagel, Gregory L., 2016. "Private information and limitations of Heckman's estimator in banking and corporate finance research," Journal of Empirical Finance, Elsevier, vol. 37(C), pages 186-195.
    17. Giuliani, Elisa & Martinelli, Arianna & Rabellotti, Roberta, 2016. "Is Co-Invention Expediting Technological Catch Up? A Study of Collaboration between Emerging Country Firms and EU Inventors," World Development, Elsevier, vol. 77(C), pages 192-205.
    18. Maurice Mutisya & Moses W. Ngware & Caroline W. Kabiru & Ngianga-bakwin Kandala, 2016. "The effect of education on household food security in two informal urban settlements in Kenya: a longitudinal analysis," Food Security: The Science, Sociology and Economics of Food Production and Access to Food, Springer;The International Society for Plant Pathology, vol. 8(4), pages 743-756, August.
    19. Ilona Babenko & Benjamin Bennett & John M Bizjak & Jeffrey L Coles & Jason J Sandvik, 2023. "Clawback Provisions and Firm Risk," The Review of Corporate Finance Studies, Society for Financial Studies, vol. 12(2), pages 191-239.
    20. Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-dimensional econometrics and regularized GMM," CeMMAP working papers CWP35/18, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2106.10341. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.