IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v15y2021i4d10.1007_s11634-021-00444-9.html
   My bibliography  Save this article

Estimating the class prior for positive and unlabelled data via logistic regression

Author

Listed:
  • Małgorzata Łazęcka

    (Polish Academy of Sciences
    Warsaw University of Technology)

  • Jan Mielniczuk

    (Polish Academy of Sciences
    Warsaw University of Technology)

  • Paweł Teisseyre

    (Polish Academy of Sciences
    Warsaw University of Technology)

Abstract

In the paper, we revisit the problem of class prior probability estimation with positive and unlabelled data gathered in a single-sample scenario. The task is important as it is known that in positive unlabelled setting, a classifier can be successfully learned if the class prior is available. We show that without additional assumptions, class prior probability is not identifiable and thus the existing non-parametric estimators are necessarily biased in general if extra assumptions are not imposed. The magnitude of their bias is also investigated. The problem becomes identifiable when the probabilistic structure satisfies mild semi-parametric assumptions. Consequently, we propose a method based on a logistic fit and a concave minorization of its (non-concave) log-likelihood. The experiments conducted on artificial and benchmark datasets as well as on a large clinical database MIMIC indicate that the estimation errors for the proposed method are usually lower than for its competitors and that it is robust against departures from logistic settings.

Suggested Citation

  • Małgorzata Łazęcka & Jan Mielniczuk & Paweł Teisseyre, 2021. "Estimating the class prior for positive and unlabelled data via logistic regression," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(4), pages 1039-1068, December.
  • Handle: RePEc:spr:advdac:v:15:y:2021:i:4:d:10.1007_s11634-021-00444-9
    DOI: 10.1007/s11634-021-00444-9
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-021-00444-9
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-021-00444-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Gill Ward & Trevor Hastie & Simon Barry & Jane Elith & John R. Leathwick, 2009. "Presence-Only Data and the EM Algorithm," Biometrics, The International Biometric Society, vol. 65(2), pages 554-563, June.
    2. Lancaster, Tony & Imbens, Guido, 1996. "Case-control studies with contaminated controls," Journal of Econometrics, Elsevier, vol. 71(1-2), pages 145-160.
    3. Hyebin Song & Garvesh Raskutti, 2020. "PUlasso: High-Dimensional Variable Selection With Presence-Only Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 115(529), pages 334-347, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Erard Brian, 2022. "Modeling Qualitative Outcomes by Supplementing Participant Data with General Population Data: A New and More Versatile Approach," Journal of Econometric Methods, De Gruyter, vol. 11(1), pages 35-53, January.
    2. Wenkai Li & Yuanchi Liu & Ziyue Liu & Zhen Gao & Huabing Huang & Weijun Huang, 2022. "A Positive-Unlabeled Learning Algorithm for Urban Flood Susceptibility Modeling," Land, MDPI, vol. 11(11), pages 1-17, November.
    3. Robert M. Dorazio, 2012. "Predicting the Geographic Distribution of a Species from Presence-Only Data Subject to Detection Errors," Biometrics, The International Biometric Society, vol. 68(4), pages 1303-1312, December.
    4. Sung Jae Jun & Sokbae Lee, 2024. "Causal Inference Under Outcome-Based Sampling with Monotonicity Assumptions," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 42(3), pages 998-1009, July.
    5. Masahiro Kato & Shota Yasui, 2020. "Learning Classifiers under Delayed Feedback with a Time Window Assumption," Papers 2009.13092, arXiv.org, revised Jun 2022.
    6. Esmerelda A. Ramalho & Richard Smith, 2003. "Discrete choice non-response," CeMMAP working papers 07/03, Institute for Fiscal Studies.
    7. Erard, Brian & Langetieg, Patrick & Payne, Mark & Plumley, Alan, 2020. "Ghosts in the Income Tax Machinery," MPRA Paper 100036, University Library of Munich, Germany.
    8. Amanda Coston & Edward H. Kennedy, 2022. "The role of the geometric mean in case-control studies," Papers 2207.09016, arXiv.org.
    9. Schwemmer, Philipp & Güpner, Franziska & Adler, Sven & Klingbeil, Knut & Garthe, Stefan, 2016. "Modelling small-scale foraging habitat use in breeding Eurasian oystercatchers (Haematopus ostralegus) in relation to prey distribution and environmental predictors," Ecological Modelling, Elsevier, vol. 320(C), pages 322-333.
    10. Ashton, John & Burnett, Tim & Diaz-Rainey, Ivan & Ormosi, Peter, 2021. "Known unknowns: How much financial misconduct is detected and deterred?," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 74(C).
    11. Vincenzo Caponi & Miana Plesca, 2014. "Empirical characteristics of legal and illegal immigrants in the USA," Journal of Population Economics, Springer;European Society for Population Economics, vol. 27(4), pages 923-960, October.
    12. Sung Jae Jun & Sokbae (Simon) Lee, 2020. "Causal inference in case-control studies," CeMMAP working papers CWP19/20, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    13. Lee, Kangbok & Joo, Sunghoon & Baik, Hyeoncheol & Han, Sumin & In, Joonhwan, 2020. "Unbalanced data, type II error, and nonlinearity in predicting M&A failure," Journal of Business Research, Elsevier, vol. 109(C), pages 271-287.
    14. Saupe, E.E. & Barve, V. & Myers, C.E. & Soberón, J. & Barve, N. & Hensz, C.M. & Peterson, A.T. & Owens, H.L. & Lira-Noriega, A., 2012. "Variation in niche and distribution model performance: The need for a priori assessment of key causal factors," Ecological Modelling, Elsevier, vol. 237, pages 11-22.
    15. Becker, Bo & Cronqvist, Henrik & Fahlenbrach, Rüdiger, 2011. "Estimating the Effects of Large Shareholders Using a Geographic Instrument," Journal of Financial and Quantitative Analysis, Cambridge University Press, vol. 46(4), pages 907-942, August.
    16. Adam M. Kleinbaum & Toby E. Stuart & Michael L. Tushman, 2013. "Discretion Within Constraint: Homophily and Structure in a Formal Organization," Organization Science, INFORMS, vol. 24(5), pages 1316-1336, October.
    17. Herkt, K. Matthias B. & Barnikel, Günter & Skidmore, Andrew K. & Fahr, Jakob, 2016. "A high-resolution model of bat diversity and endemism for continental Africa," Ecological Modelling, Elsevier, vol. 320(C), pages 9-28.
    18. Gill Ward & Trevor Hastie & Simon Barry & Jane Elith & John R. Leathwick, 2009. "Presence-Only Data and the EM Algorithm," Biometrics, The International Biometric Society, vol. 65(2), pages 554-563, June.
    19. Ramalho, Esmeralda A., 2007. "Binary models with misclassification in the variable of interest and nonignorable nonresponse," Economics Letters, Elsevier, vol. 96(1), pages 70-76, July.
    20. Esmeralda A. Ramalho & Richard J. Smith, 2013. "Discrete Choice Non-Response," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 80(1), pages 343-364.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:15:y:2021:i:4:d:10.1007_s11634-021-00444-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.