IDEAS home Printed from https://ideas.repec.org/a/bla/jorssc/v70y2021i3p558-578.html
   My bibliography  Save this article

Clustering based on Kolmogorov–Smirnov statistic with application to bank card transaction data

Author

Listed:
  • Yingqiu Zhu
  • Qiong Deng
  • Danyang Huang
  • Bingyi Jing
  • Bo Zhang

Abstract

Rapid developments in third‐party online payment platforms now make it possible to record massive bank card transaction data. Clustering on such transaction data is of great importance for the analysis of merchant behaviours. However, traditional methods based on generated features inevitably lead to much loss of information. To make better use of bank card transaction data, this study investigates the possibility of using the empirical cumulative distribution of transaction amounts. As the distance between two merchants can be measured using the two‐sample Kolmogorov–Smirnov test statistic, we propose the Kolmogorov–Smirnov K‐means clustering approach based on this distance measure. An approximation step is conducted to ensure the feasibility of the proposed method even for large‐scale transaction data, and the associated theoretical properties are investigated. Both simulations and an empirical study demonstrate that our method outperforms feature‐based methods and is computationally efficient for large‐scale data sets.

Suggested Citation

  • Yingqiu Zhu & Qiong Deng & Danyang Huang & Bingyi Jing & Bo Zhang, 2021. "Clustering based on Kolmogorov–Smirnov statistic with application to bank card transaction data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 70(3), pages 558-578, June.
  • Handle: RePEc:bla:jorssc:v:70:y:2021:i:3:p:558-578
    DOI: 10.1111/rssc.12471
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssc.12471
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssc.12471?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Mahmood Alborzi & Mohammad Khanbabaei, 2016. "Using data mining and neural networks techniques to propose a new hybrid customer behaviour analysis and credit scoring model in banking services based on a developed RFM analysis method," International Journal of Business Information Systems, Inderscience Enterprises Ltd, vol. 23(1), pages 1-22.
    2. Zhu, Xuwen & Melnykov, Volodymyr, 2018. "Manly transformation in finite mixture modeling," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 190-208.
    3. Robert Tibshirani & Guenther Walther & Trevor Hastie, 2001. "Estimating the number of clusters in a data set via the gap statistic," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 63(2), pages 411-423.
    4. Holger Dannenberg & Dirk Zupancic, 2009. "Customer segmentation," Springer Books, in: Excellence in Sales, chapter 7, pages 85-93, Springer.
    5. Holger Dannenberg & Dirk Zupancic, 2009. "Definition of sales process goals for customer segments," Springer Books, in: Excellence in Sales, chapter 8, pages 95-100, Springer.
    6. Peppard, Joe, 2000. "Customer Relationship Management (CRM) in financial services," European Management Journal, Elsevier, vol. 18(3), pages 312-327, June.
    7. McCarty, John A. & Hastak, Manoj, 2007. "Segmentation approaches in data-mining: A comparison of RFM, CHAID, and logistic regression," Journal of Business Research, Elsevier, vol. 60(6), pages 656-662, June.
    8. Jan Roelf Bult & Tom Wansbeek, 1995. "Optimal Selection for Direct Mail," Marketing Science, INFORMS, vol. 14(4), pages 378-394.
    9. Holger Dannenberg & Dirk Zupancic, 2009. "Excellence in Sales," Springer Books, Springer, number 978-3-8349-8782-2, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Danijel Bratina & Armand Faganel, 2023. "Using Supervised Machine Learning Methods for RFM Segmentation: A Casino Direct Marketing Communication Case," Tržište/Market, Faculty of Economics and Business, University of Zagreb, vol. 35(1), pages 7-22.
    2. Philippe Baecke & Dirk Van Den Poel, 2010. "Improving Purchasing Behavior Predictions By Data Augmentation With Situational Variables," International Journal of Information Technology & Decision Making (IJITDM), World Scientific Publishing Co. Pte. Ltd., vol. 9(06), pages 853-872.
    3. Hache, Emmanuel & Leboullenger, Déborah & Mignon, Valérie, 2017. "Beyond average energy consumption in the French residential housing market: A household classification approach," Energy Policy, Elsevier, vol. 107(C), pages 82-95.
    4. Thiemo Fetzer & Samuel Marden, 2017. "Take What You Can: Property Rights, Contestability and Conflict," Economic Journal, Royal Economic Society, vol. 0(601), pages 757-783, May.
    5. Daniel Agness & Travis Baseler & Sylvain Chassang & Pascaline Dupas & Erik Snowberg, 2022. "Valuing the Time of the Self-Employed," Working Papers 2022-2, Princeton University. Economics Department..
    6. Khanh Duong, 2024. "Is meritocracy just? New evidence from Boolean analysis and Machine learning," Journal of Computational Social Science, Springer, vol. 7(2), pages 1795-1821, October.
    7. Batool, Fatima & Hennig, Christian, 2021. "Clustering with the Average Silhouette Width," Computational Statistics & Data Analysis, Elsevier, vol. 158(C).
    8. Chen, Yanhong & Liu, Luning & Zheng, Dequan & Li, Bin, 2023. "Estimating travellers’ value when purchasing auxiliary services in the airline industry based on the RFM model," Journal of Retailing and Consumer Services, Elsevier, vol. 74(C).
    9. Nicoleta Serban & Huijing Jiang, 2012. "Multilevel Functional Clustering Analysis," Biometrics, The International Biometric Society, vol. 68(3), pages 805-814, September.
    10. Jie Sun & Jie Li & Hamido Fujita & Wenguo Ai, 2023. "Multiclass financial distress prediction based on one‐versus‐one decomposition integrated with improved decision‐directed acyclic graph," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(5), pages 1167-1186, August.
    11. I. Albarrán & P. Alonso-González & J. M. Marin, 2017. "Some criticism to a general model in Solvency II: an explanation from a clustering point of view," Empirical Economics, Springer, vol. 52(4), pages 1289-1308, June.
    12. Orietta Nicolis & Jean Paul Maidana & Fabian Contreras & Danilo Leal, 2024. "Analyzing the Impact of COVID-19 on Economic Sustainability: A Clustering Approach," Sustainability, MDPI, vol. 16(4), pages 1-30, February.
    13. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    14. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    15. Durango-Cohen, Elizabeth J., 2013. "Modeling contribution behavior in fundraising: Segmentation analysis for a public broadcasting station," European Journal of Operational Research, Elsevier, vol. 227(3), pages 538-551.
    16. Yana Melnykov & Marcus Perry, 2024. "On Robust Change Point Detection and Estimation in Multisubject Studies," Sankhya A: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 86(2), pages 827-879, August.
    17. Forzani, Liliana & Gieco, Antonella & Tolmasky, Carlos, 2017. "Likelihood ratio test for partial sphericity in high and ultra-high dimensions," Journal of Multivariate Analysis, Elsevier, vol. 159(C), pages 18-38.
    18. Yujia Li & Xiangrui Zeng & Chien‐Wei Lin & George C. Tseng, 2022. "Simultaneous estimation of cluster number and feature sparsity in high‐dimensional cluster analysis," Biometrics, The International Biometric Society, vol. 78(2), pages 574-585, June.
    19. Vojtech Blazek & Michal Petruzela & Tomas Vantuch & Zdenek Slanina & Stanislav Mišák & Wojciech Walendziuk, 2020. "The Estimation of the Influence of Household Appliances on the Power Quality in a Microgrid System," Energies, MDPI, vol. 13(17), pages 1-21, August.
    20. YongSeog Kim & W. Nick Street & Gary J. Russell & Filippo Menczer, 2005. "Customer Targeting: A Neural Network Approach Guided by Genetic Algorithms," Management Science, INFORMS, vol. 51(2), pages 264-276, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssc:v:70:y:2021:i:3:p:558-578. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.