IDEAS home Printed from https://ideas.repec.org/a/vrs/offsta/v38y2022i2p485-508n8.html
   My bibliography  Save this article

Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms

Author

Listed:
  • Meertens Q.A.

    (Statistics Netherlands, Henri Faasdreef 312, 2492 JP The Hague, the, Netherlands .)

  • Diks C.G.H.

    (University of Amsterdam, Center for Nonlinear Dynamics in Economics and Finance, Roetersstraat 11, 1018 WB Amsterdam, the, Netherlands .)

  • van den Herik H.J.
  • Takes F.W.

    (Leiden University, Niels Bohrweg 1, 2333 CA Leiden the, Netherlands .)

Abstract

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key issue is concept drift, that is, when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. An alternative is to reduce the bias by means of bias correction methods. In the article, we focus on estimating the proportion (base rate) of a category of interest and we compare two popular bias correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two methods analytically as well as numerically. Our analytical results are expressions for the bias and variance of both methods. As numerical result, we present a decision boundary for the relative performance of the two methods. Our results provide a better understanding of the effect of prior probability shift on output quality. Consequently, we may recommend a novel approach on how to use machine learning algorithms in the context of official statistics.

Suggested Citation

  • Meertens Q.A. & Diks C.G.H. & van den Herik H.J. & Takes F.W., 2022. "Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms," Journal of Official Statistics, Sciendo, vol. 38(2), pages 485-508, June.
  • Handle: RePEc:vrs:offsta:v:38:y:2022:i:2:p:485-508:n:8
    DOI: 10.2478/jos-2022-0023
    as

    Download full text from publisher

    File URL: https://doi.org/10.2478/jos-2022-0023
    Download Restriction: no

    File URL: https://libkey.io/10.2478/jos-2022-0023?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:vrs:offsta:v:38:y:2022:i:2:p:485-508:n:8. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.sciendo.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.