IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v89y2021i2p382-401.html
   My bibliography  Save this article

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Author

Listed:
  • Jae‐Kwang Kim
  • Siu‐Ming Tam

Abstract

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

Suggested Citation

  • Jae‐Kwang Kim & Siu‐Ming Tam, 2021. "Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference," International Statistical Review, International Statistical Institute, vol. 89(2), pages 382-401, August.
  • Handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401
    DOI: 10.1111/insr.12434
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/insr.12434
    Download Restriction: no

    File URL: https://libkey.io/10.1111/insr.12434?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jae Kwang Kim & J. N. K. Rao, 2009. "A unified approach to linearization variance estimation from survey data after imputation for item nonresponse," Biometrika, Biometrika Trust, vol. 96(4), pages 917-932.
    2. Jae Kwang Kim & Mingue Park, 2010. "Calibration Estimation in Survey Sampling," International Statistical Review, International Statistical Institute, vol. 78(1), pages 21-39, April.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Li‐Chun Zhang, 2012. "Topics of statistical theory for register‐based statistics and data integration," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(1), pages 41-63, February.
    5. Niels Keiding & Thomas A. Louis, 2016. "Perils and potentials of self-selected entry to epidemiological studies and surveys," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 179(2), pages 319-376, February.
    6. repec:bla:istatr:v:83:y:2015:i:3:p:436-448 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Medous, Estelle & Goga, Camelia & Ruiz-Gazen, Anne & Beaumont, Jean-François & Dessertaine, Alain & Puech, Pauline, 2022. "QR Prediction for Statistical Data Integration," TSE Working Papers 22-1344, Toulouse School of Economics (TSE).
    2. Chien-Min Huang & F. Jay Breidt, 2023. "A dual-frame approach for estimation with respondent-driven samples," METRON, Springer;Sapienza Università di Roma, vol. 81(1), pages 65-81, April.
    3. Ieva Burakauskaitė & Andrius Čiginas, 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census," Mathematics, MDPI, vol. 11(8), pages 1-14, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lothian Jack & Holmberg Anders & Seyb Allyson, 2019. "An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics," Journal of Official Statistics, Sciendo, vol. 35(1), pages 137-165, March.
    2. Serena Pattaro & Nick Bailey & Chris Dibben, 2020. "Using Linked Longitudinal Administrative Data to Identify Social Disadvantage," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 147(3), pages 865-895, February.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Peter G. M. van der Heijden & Maarten Cruyff & Paul A. Smith & Christine Bycroft & Patrick Graham & Nathaniel Matheson‐Dunning, 2022. "Multiple system estimation using covariates having missing values and measurement error: Estimating the size of the Māori population in New Zealand," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(1), pages 156-177, January.
    5. Bakker Bart F.M. & Heijden Peter G.M. van der & Scholtus Sander, 2015. "Preface," Journal of Official Statistics, Sciendo, vol. 31(3), pages 349-355, September.
    6. Fulvia Cerroni & Grazia Di Bella & Lorena Galiè, 2014. "Evaluating administrative data quality as inputof the statistical production process," Rivista di statistica ufficiale, ISTAT - Italian National Institute of Statistics - (Rome, ITALY), vol. 16(1-2), pages 117-146.
    7. Jonas F. Schenkel & Li‐Chun Zhang, 2022. "Adjusting misclassification using a second classifier with an external validation sample," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1882-1902, October.
    8. Fabrizio Antolini & Laura Grassini, 2020. "Methodological problems in the economic measurement of tourism: the need for new sources of information," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(5), pages 1769-1780, December.
    9. Elżbieta Gołata, 2016. "Shift In Methodology And Population Census Quality," Statistics in Transition New Series, Polish Statistical Association, vol. 17(4), pages 631-658, December.
    10. Denis Devaud & Yves Tillé, 2019. "Deville and Särndal’s calibration: revisiting a 25-years-old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1033-1065, December.
    11. Stephanie Coffey, PhD. & Jaya Damineni & John Eltinge, PhD. & Anup Mathur, PhD. & Kayla Varela & Allison Zotti, 2023. "Some Open Questions on Multiple-Source Extensions of Adaptive-Survey Design Concepts and Methods," Working Papers 23-03, Center for Economic Studies, U.S. Census Bureau.
    12. Li-Chun Zhang & Ib Thomsen & Øyvin Kleven, 2013. "On the Use of Auxiliary and Paradata for Dealing With Non-sampling Errors in Household Surveys," International Statistical Review, International Statistical Institute, vol. 81(2), pages 270-288, August.
    13. Gelein, Brigitte & Haziza, David & Causeur, David, 2014. "Preserving relationships between variables with MIVQUE based imputation for missing survey data," Journal of Multivariate Analysis, Elsevier, vol. 131(C), pages 197-208.
    14. Paul Allin & David J. Hand, 2017. "New statistics for old?—measuring the wellbeing of the UK," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 180(1), pages 3-43, January.
    15. Ton de Waal & Arnout van Delden & Sander Scholtus, 2020. "Multi‐source Statistics: Basic Situations and Methods," International Statistical Review, International Statistical Institute, vol. 88(1), pages 203-228, April.
    16. Elżbieta Gołata, 2015. "Sae Education Challenges To Academics And Nsi," Statistics in Transition New Series, Polish Statistical Association, vol. 16(4), pages 611-630, December.
    17. Yingli Pan & Wen Cai & Zhan Liu, 2022. "Inference for non-probability samples under high-dimensional covariate-adjusted superpopulation model," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(4), pages 955-979, October.
    18. J. N. K. Rao, 2021. "On Making Valid Inferences by Integrating Data from Surveys and Other Sources," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 83(1), pages 242-272, May.
    19. Xiaojun Mao & Zhonglei Wang & Shu Yang, 2023. "Matrix completion under complex survey sampling," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 75(3), pages 463-492, June.
    20. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.