IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v162y2021ics0167947321001018.html
   My bibliography  Save this article

Fast multivariate empirical cumulative distribution function with connection to kernel density estimation

Author

Listed:
  • Langrené, Nicolas
  • Warin, Xavier

Abstract

The problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets, is revisited. Computing an ECDF at one evaluation point requires O(N) operations on a dataset composed of N data points. Therefore, a direct evaluation of ECDFs at N evaluation points requires a quadratic O(N2) operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a O(Nlog⁡N) complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a O(Nlog⁡(N)(d−1)∨1) complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general d-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, a direct connection between cumulative distribution functions and kernel density estimation (KDE) is established for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.

Suggested Citation

  • Langrené, Nicolas & Warin, Xavier, 2021. "Fast multivariate empirical cumulative distribution function with connection to kernel density estimation," Computational Statistics & Data Analysis, Elsevier, vol. 162(C).
  • Handle: RePEc:eee:csdana:v:162:y:2021:i:c:s0167947321001018
    DOI: 10.1016/j.csda.2021.107267
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947321001018
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2021.107267?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Rong Liu & Lijian Yang, 2008. "Kernel estimation of multivariate cumulative distribution function," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 20(8), pages 661-677.
    2. Justel, Ana & Peña, Daniel & Zamar, Rubén, 1997. "A multivariate Kolmogorov-Smirnov test of goodness of fit," Statistics & Probability Letters, Elsevier, vol. 35(3), pages 251-259, October.
    3. Zhang, Xibin & King, Maxwell L. & Hyndman, Rob J., 2006. "A Bayesian approach to bandwidth selection for multivariate kernel density estimation," Computational Statistics & Data Analysis, Elsevier, vol. 50(11), pages 3009-3031, July.
    4. repec:dau:papers:123456789/4273 is not listed on IDEAS
    5. Goldfeld, Stephen M. & Quandt, Richard E., 1981. "Econometric modelling with non-normal disturbances," Journal of Econometrics, Elsevier, vol. 17(2), pages 141-155, November.
    6. Chiu, Sung Nok & Liu, Kwong Ip, 2009. "Generalized Cramér-von Mises goodness-of-fit tests for multivariate distributions," Computational Statistics & Data Analysis, Elsevier, vol. 53(11), pages 3817-3834, September.
    7. Marron, J. S. & Nolan, D., 1988. "Canonical kernels for density estimation," Statistics & Probability Letters, Elsevier, vol. 7(3), pages 195-199, December.
    8. Schmid, Friedrich & Schmidt, Rafael, 2007. "Multivariate extensions of Spearman's rho and related statistics," Statistics & Probability Letters, Elsevier, vol. 77(4), pages 407-416, February.
    9. Duong, Tarn, 2015. "Spherically symmetric multivariate beta family kernels," Statistics & Probability Letters, Elsevier, vol. 104(C), pages 141-145.
    10. Duc Devroye & J. Beirlant & R. Cao & R. Fraiman & P. Hall & M. Jones & Gábor Lugosi & E. Mammen & J. Marron & C. Sánchez-Sellero & J. Uña & F. Udina & L. Devroye, 1997. "Universal smoothing factor selection in density estimation: theory and practice," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 6(2), pages 223-320, December.
    11. Politis, Dimitris N. & Romano, Joseph P., 1999. "Multivariate Density Estimation with General Flat-Top Kernels of Infinite Order," Journal of Multivariate Analysis, Elsevier, vol. 68(1), pages 1-25, January.
    12. Nguyen, Truc T. & Chen, John T., 2009. "A connection between the double gamma model and Laplace sample mean," Statistics & Probability Letters, Elsevier, vol. 79(10), pages 1305-1310, May.
    13. Hadri, Kaddour, 1996. "A note on Sargan densities," Journal of Econometrics, Elsevier, vol. 71(1-2), pages 285-290.
    14. Ingrid K. Glad & Nils Lid Hjort & Nikolai G. Ushakov, 2003. "Correction of Density Estimators that are not Densities," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 30(2), pages 415-427, June.
    15. Missiakoulis, Spyros, 1983. "Sargan densities which one?," Journal of Econometrics, Elsevier, vol. 23(2), pages 223-233, October.
    16. David Lee & Harry Joe, 2018. "Efficient computation of multivariate empirical distribution functions at the observed values," Computational Statistics, Springer, vol. 33(3), pages 1413-1428, September.
    17. Tse, Y. K., 1987. "A note on Sargan densities," Journal of Econometrics, Elsevier, vol. 34(3), pages 349-354, March.
    18. Hansen, Bruce E., 2005. "Exact Mean Integrated Squared Error Of Higher Order Kernel Estimators," Econometric Theory, Cambridge University Press, vol. 21(6), pages 1031-1057, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gunsilius, Florian F., 2023. "A condition for the identification of multivariate models with binary instruments," Journal of Econometrics, Elsevier, vol. 235(1), pages 220-238.
    2. Hernández-Maldonado, Victor Miguel & Erdely, Arturo & Díaz-Viera, Martín & Rios, Leonardo, 2024. "Fast procedure to compute empirical and Bernstein copulas," Applied Mathematics and Computation, Elsevier, vol. 477(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hadri, Kaddour, 1996. "A note on Sargan densities," Journal of Econometrics, Elsevier, vol. 71(1-2), pages 285-290.
    2. Hu, Shuowen & Poskitt, D.S. & Zhang, Xibin, 2012. "Bayesian adaptive bandwidth kernel density estimation of irregular multivariate distributions," Computational Statistics & Data Analysis, Elsevier, vol. 56(3), pages 732-740.
    3. Nils-Bastian Heidenreich & Anja Schindler & Stefan Sperlich, 2013. "Bandwidth selection for kernel density estimation: a review of fully automatic selectors," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 97(4), pages 403-433, October.
    4. Matthew D. Baird, 2014. "Cross Validation Bandwidth Selection for Derivatives of Multidimensional Densities," Working Papers WR-1060, RAND Corporation.
    5. Catalina Bolance & Montserrat Guillen & David Pitt, 2014. "Non-parametric Models for Univariate Claim Severity Distributions - an approach using R," Working Papers 2014-01, Universitat de Barcelona, UB Riskcenter.
    6. Horváth, Lajos & Rice, Gregory & Whipple, Stephen, 2016. "Adaptive bandwidth selection in the long run covariance estimator of functional time series," Computational Statistics & Data Analysis, Elsevier, vol. 100(C), pages 676-693.
    7. SCHAFGANS, Marcia M.A. & ZINDE-WALSH, Victoria, 2007. "Robust Average Derivative Estimation," Cahiers de recherche 12-2007, Centre interuniversitaire de recherche en économie quantitative, CIREQ.
    8. Bissantz, Nicolai & Holzmann, Hajo, 2007. "Statistical inference for inverse problems," Technical Reports 2007,40, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    9. Henderson, Daniel J. & Parmeter, Christopher F., 2012. "Normal reference bandwidths for the general order, multivariate kernel density derivative estimator," Statistics & Probability Letters, Elsevier, vol. 82(12), pages 2198-2205.
    10. Madeleine Cule & Richard Samworth & Michael Stewart, 2010. "Maximum likelihood estimation of a multi‐dimensional log‐concave density," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 72(5), pages 545-607, November.
    11. Camelia Minoiu & Sanjay Reddy, 2014. "Kernel density estimation on grouped data: the case of poverty assessment," The Journal of Economic Inequality, Springer;Society for the Study of Economic Inequality, vol. 12(2), pages 163-189, June.
    12. Song Li & Mervyn J. Silvapulle & Param Silvapulle & Xibin Zhang, 2015. "Bayesian Approaches to Nonparametric Estimation of Densities on the Unit Interval," Econometric Reviews, Taylor & Francis Journals, vol. 34(3), pages 394-412, March.
    13. Mukhopadhyay, Subhadeep & Ghosh, Anil K., 2011. "Bayesian multiscale smoothing in supervised and semi-supervised kernel discriminant analysis," Computational Statistics & Data Analysis, Elsevier, vol. 55(7), pages 2344-2353, July.
    14. Jeffrey Racine, 2015. "Mixed data kernel copulas," Empirical Economics, Springer, vol. 48(1), pages 37-59, February.
    15. Chen, Le-Yu & Lee, Sokbae, 2019. "Breaking the curse of dimensionality in conditional moment inequalities for discrete choice models," Journal of Econometrics, Elsevier, vol. 210(2), pages 482-497.
    16. Grothe, Oliver & Schnieders, Julius & Segers, Johan, 2013. "Measuring Association and Dependence Between Random Vectors," LIDAM Discussion Papers ISBA 2013026, Université catholique de Louvain, Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA).
    17. Lan Xue & Jing Wang, 2010. "Distribution function estimation by constrained polynomial spline regression," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 22(4), pages 443-457.
    18. Hassan Doosti & Peter Hall, 2016. "Making a non-parametric density estimator more attractive, and more accurate, by data perturbation," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(2), pages 445-462, March.
    19. Saralees Nadarajah, 2009. "Laplace random variables with application to price indices," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 93(3), pages 345-369, September.
    20. Squalli, Jay, 2017. "Renewable energy, coal as a baseload power source, and greenhouse gas emissions: Evidence from U.S. state-level data," Energy, Elsevier, vol. 127(C), pages 479-488.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:162:y:2021:i:c:s0167947321001018. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.