IDEAS home Printed from https://ideas.repec.org/a/eee/phsmap/v649y2024ics0378437124004369.html
   My bibliography  Save this article

Beyond Zipf’s law: Exploring the discrete generalized beta distribution in open-source repositories

Author

Listed:
  • Nowak, Przemysław
  • Santolini, Marc
  • Singh, Chakresh
  • Siudem, Grzegorz
  • Tupikina, Liubov

Abstract

Rank-size distributions, such as Zipf’s Law, have been instrumental in providing insights into the emergence of hierarchies across diverse systems, from linguistic corpuses to urban structures. However, the application of Zipf’s Law reveals limitations, particularly in its focus on distribution tails, sometimes overlooking a large proportion of the data which might play a pivotal role in system dynamics. Yet, fitting rank-size distributions other than a straight line on the log–log scale requires caution. In this study, we re-evaluate the utility of rank-size distributions by contrasting the traditional Zipf’s Law with the Discrete Generalized Beta Distribution (DGBD). We show the need of cautious fitting techniques for rank distributions, including the use of binning to prevent overfitting to data tails. Through both analytical derivation and empirical validation on commit data of open-source repositories, we show that DGBD consistently improves over Zipf distribution for concave rank distributions of large datasets (N≥100). This approach contributes to the advancement of methodologies for analyzing hierarchical systems.

Suggested Citation

  • Nowak, Przemysław & Santolini, Marc & Singh, Chakresh & Siudem, Grzegorz & Tupikina, Liubov, 2024. "Beyond Zipf’s law: Exploring the discrete generalized beta distribution in open-source repositories," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 649(C).
  • Handle: RePEc:eee:phsmap:v:649:y:2024:i:c:s0378437124004369
    DOI: 10.1016/j.physa.2024.129927
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0378437124004369
    Download Restriction: Full text for ScienceDirect subscribers only. Journal offers the option of making the article available online on Science direct for a fee of $3,000

    File URL: https://libkey.io/10.1016/j.physa.2024.129927?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Michal Brzezinski, 2015. "Power laws in citation distributions: evidence from Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 103(1), pages 213-228, April.
    2. Mansilla, R. & Köppen, E. & Cocho, G. & Miramontes, P., 2007. "On the behavior of journal impact factor rank-order distribution," Journal of Informetrics, Elsevier, vol. 1(2), pages 155-160.
    3. Alvarez-Martinez, R. & Martinez-Mekler, G. & Cocho, G., 2011. "Order–disorder transition in conflicting dynamics leading to rank–frequency generalized beta distributions," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 390(1), pages 120-130.
    4. Roy Cerqueti & Marcel Ausloos, 2015. "Cross Ranking of Cities and Regions: Population vs. Income," Papers 1506.02414, arXiv.org.
    5. Regina Nuzzo, 2014. "Scientific method: Statistical errors," Nature, Nature, vol. 506(7487), pages 150-152, February.
    6. Li, Wentian, 2012. "Fitting Chinese syllable-to-character mapping spectrum by the beta rank function," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(4), pages 1515-1518.
    7. Jiong Liu & R. A. Serota, 2023. "Rethinking Generalized Beta family of distributions," The European Physical Journal B: Condensed Matter and Complex Systems, Springer;EDP Sciences, vol. 96(2), pages 1-14, February.
    8. Grzegorz Siudem & Barbara Żogała-Siudem & Anna Cena & Marek Gagolewski, 2020. "Three dimensions of scientific impact," Proceedings of the National Academy of Sciences, Proceedings of the National Academy of Sciences, vol. 117(25), pages 13896-13900, June.
    9. Marcel Ausloos & Roy Cerqueti, 2016. "A Universal Rank-Size Law," PLOS ONE, Public Library of Science, vol. 11(11), pages 1-15, November.
    10. Zhang, Jiang & Feng, Yuanjing, 2014. "Common patterns of energy flow and biomass distribution on weighted food webs," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 405(C), pages 278-288.
    11. Ghosh, Abhik & Shreya, Preety & Basu, Banasri, 2021. "Maximum entropy framework for a universal rank order distribution with socio-economic applications," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 563(C).
    12. Ghosh, Abhik & Basu, Banasri, 2019. "Universal City-size distributions through rank ordering," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 528(C).
    13. Zoltán Néda & Levente Varga & Tamás S Biró, 2017. "Science and Facebook: The same popularity law!," PLOS ONE, Public Library of Science, vol. 12(7), pages 1-11, July.
    14. Gerardo Iñiguez & Carlos Pineda & Carlos Gershenson & Albert-László Barabási, 2022. "Dynamics of ranking," Nature Communications, Nature, vol. 13(1), pages 1-7, December.
    15. Oscar Fontanelli & Pedro Miramontes & Ricardo Mansilla & Germinal Cocho & Wentian Li, 2022. "Beta rank function: A smooth double-Pareto-like distribution," Communications in Statistics - Theory and Methods, Taylor & Francis Journals, vol. 51(11), pages 3645-3668, June.
    16. Siudem, Grzegorz & Nowak, Przemysław & Gagolewski, Marek, 2022. "Power laws, the Price model, and the Pareto type-2 distribution," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 606(C).
    17. Petter Holme, 2022. "Universality out of order," Nature Communications, Nature, vol. 13(1), pages 1-3, December.
    18. Gangopadhyay, Kausik & Basu, B., 2009. "City size distributions for India and China," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 388(13), pages 2682-2688.
    19. Oscar Fontanelli & Pedro Miramontes & Yaning Yang & Germinal Cocho & Wentian Li, 2016. "Beyond Zipf’s Law: The Lavalette Rank Function and Its Properties," PLOS ONE, Public Library of Science, vol. 11(9), pages 1-14, September.
    20. Beltrán del Río, M. & Cocho, G. & Naumis, G.G., 2008. "Universality in the tail of musical note rank distribution," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 387(22), pages 5552-5560.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mrowinski, Maciej J. & Gagolewski, Marek & Siudem, Grzegorz, 2022. "Accidentality in journal citation patterns," Journal of Informetrics, Elsevier, vol. 16(4).
    2. Ghosh, Abhik & Mallick, Olivia & Chattopadhay, Souvik & Basu, Banasri, 2022. "Strata-based quantification of distributional uncertainty in socio-economic indicators: A comparative study of Indian states," Socio-Economic Planning Sciences, Elsevier, vol. 81(C).
    3. Cena, Anna & Gagolewski, Marek & Siudem, Grzegorz & Żogała-Siudem, Barbara, 2022. "Validating citation models by proxy indices," Journal of Informetrics, Elsevier, vol. 16(2).
    4. Marcel Ausloos & Roy Cerqueti, 2016. "Studies on Regional Wealth Inequalities: the case of Italy," Papers 1602.05356, arXiv.org.
    5. Cerqueti, Roy & Lupi, Claudio & Pietrovito, Filomena & Pozzolo, Alberto Franco, 2022. "Rank–size distributions for banks: A cross-country analysis," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 585(C).
    6. Roy Cerqueti & Eleonora Cutrini, 2021. "A Framework for Modelling Economic Regional Location Processes Under Uncertainty," Journal of Quantitative Economics, Springer;The Indian Econometric Society (TIES), vol. 19(4), pages 703-725, December.
    7. Pankaj Bajracharya & Selima Sultana, 2020. "Rank-size Distribution of Cities and Municipalities in Bangladesh," Sustainability, MDPI, vol. 12(11), pages 1-26, June.
    8. Alvarez-Martínez, R. & Cocho, G. & Rodríguez, R.F. & Martínez-Mekler, G., 2014. "Birth and death master equation for the evolution of complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 402(C), pages 198-208.
    9. Biró, Tamás S. & Telcs, András & Józsa, Máté & Néda, Zoltán, 2023. "Gintropic scaling of scientometric indexes," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 618(C).
    10. Elio Roca-Flores & Gerardo G. Naumis, 2021. "Assessing statistical hurricane risks: nonlinear regression and time-window analysis of North Atlantic annual accumulated cyclonic energy rank profile," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 108(3), pages 2455-2465, September.
    11. Bertoli-Barsotti, Lucio & Gagolewski, Marek & Siudem, Grzegorz & Żogała-Siudem, Barbara, 2024. "Gini-stable Lorenz curves and their relation to the generalised Pareto distribution," Journal of Informetrics, Elsevier, vol. 18(2).
    12. Siudem, Grzegorz & Nowak, Przemysław & Gagolewski, Marek, 2022. "Power laws, the Price model, and the Pareto type-2 distribution," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 606(C).
    13. Żogała-Siudem, Barbara & Cena, Anna & Siudem, Grzegorz & Gagolewski, Marek, 2023. "Interpretable reparameterisations of citation models," Journal of Informetrics, Elsevier, vol. 17(1).
    14. Jyotirmoy Sarkar, 2018. "Will P†Value Triumph over Abuses and Attacks?," Biostatistics and Biometrics Open Access Journal, Juniper Publishers Inc., vol. 7(4), pages 66-71, July.
    15. de Camargo Fiorini, Paula & Roman Pais Seles, Bruno Michel & Chiappetta Jabbour, Charbel Jose & Barberio Mariano, Enzo & de Sousa Jabbour, Ana Beatriz Lopes, 2018. "Management theory and big data literature: From a review to a research agenda," International Journal of Information Management, Elsevier, vol. 43(C), pages 112-129.
    16. Ilda Inácio & José Velhinho, 2022. "Comments on Mathematical Aspects of the Biró–Néda Model," Mathematics, MDPI, vol. 10(4), pages 1-10, February.
    17. Lukas Schneider & Johannes Scholten & Bulcsú Sándor & Claudius Gros, 2021. "Charting closed-loop collective cultural decisions: from book best sellers and music downloads to Twitter hashtags and Reddit comments," The European Physical Journal B: Condensed Matter and Complex Systems, Springer;EDP Sciences, vol. 94(8), pages 1-13, August.
    18. Patrick Erik Bradley & Martin Behnisch, 2019. "Heavy-tailed distributions for building stock data," Environment and Planning B, , vol. 46(7), pages 1281-1296, September.
    19. Calderín-Ojeda, Enrique, 2016. "The distribution of all French communes: A composite parametric approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 450(C), pages 385-394.
    20. Arthur Matsuo Yamashita Rios de Sousa & Hideki Takayasu & Misako Takayasu, 2017. "Detection of statistical asymmetries in non-stationary sign time series: Analysis of foreign exchange data," PLOS ONE, Public Library of Science, vol. 12(5), pages 1-18, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:phsmap:v:649:y:2024:i:c:s0378437124004369. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.journals.elsevier.com/physica-a-statistical-mechpplications/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.