IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0238835.html
   My bibliography  Save this article

Analyzing the fine structure of distributions

Author

Listed:
  • Michael C Thrun
  • Tino Gehlert
  • Alfred Ultsch

Abstract

One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.

Suggested Citation

  • Michael C Thrun & Tino Gehlert & Alfred Ultsch, 2020. "Analyzing the fine structure of distributions," PLOS ONE, Public Library of Science, vol. 15(10), pages 1-20, October.
  • Handle: RePEc:plo:pone00:0238835
    DOI: 10.1371/journal.pone.0238835
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238835
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0238835&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0238835?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Glenn Milligan & Martha Cooper, 1988. "A study of standardization of variables in cluster analysis," Journal of Classification, Springer;The Classification Society, vol. 5(2), pages 181-204, September.
    2. Jeff Alstott & Ed Bullmore & Dietmar Plenz, 2014. "powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions," PLOS ONE, Public Library of Science, vol. 9(1), pages 1-11, January.
    3. Racine, Jeffrey S., 2008. "Nonparametric Econometrics: A Primer," Foundations and Trends(R) in Econometrics, now publishers, vol. 3(1), pages 1-88, March.
    4. Levy, Moshe & Solomon, Sorin, 1997. "New evidence for the power-law distribution of wealth," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 242(1), pages 90-94.
    5. Ferreira, Jose T.A.S. & Steel, Mark F.J., 2006. "A Constructive Representation of Univariate Skewed Distributions," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 823-829, June.
    6. Kampstra, Peter, 2008. "Beanplot: A Boxplot Alternative for Visual Comparison of Distributions," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 28(c01).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Michael C. Thrun & Alfred Ultsch, 2021. "Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 280-312, July.
    2. Marian Lux & Stefanie Rinderle-Ma, 2023. "DDCAL: Evenly Distributing Data into Low Variance Clusters Based on Iterative Feature Scaling," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 106-144, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Vijverberg, Wim P. & Hasebe, Takuya, 2015. "GTL Regression: A Linear Model with Skewed and Thick-Tailed Disturbances," IZA Discussion Papers 8898, Institute of Labor Economics (IZA).
    2. Brzezinski, Michal, 2014. "Do wealth distributions follow power laws? Evidence from ‘rich lists’," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 406(C), pages 155-162.
    3. Roberto Martino & Phu Nguyen-Van, 2014. "Labour market regulation and fiscal parameters: A structural model for European regions," Working Papers of BETA 2014-19, Bureau d'Economie Théorique et Appliquée, UDS, Strasbourg.
    4. Rubio, F.J. & Steel, M.F.J., 2011. "Inference for grouped data with a truncated skew-Laplace distribution," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3218-3231, December.
    5. George Halkos & Roman Matousek & Nickolaos Tzeremes, 2016. "Pre-evaluating technical efficiency gains from possible mergers and acquisitions: evidence from Japanese regional banks," Review of Quantitative Finance and Accounting, Springer, vol. 46(1), pages 47-77, January.
    6. Giuseppe RICCIARDO LAMONICA, 2002. "La funzionalita' nelle zone omogenee delle Marche," Working Papers 165, Universita' Politecnica delle Marche (I), Dipartimento di Scienze Economiche e Sociali.
    7. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    8. Rik Chakraborti & Gavin Roberts, 2023. "How price-gouging regulation undermined COVID-19 mitigation: county-level evidence of unintended consequences," Public Choice, Springer, vol. 196(1), pages 51-83, July.
    9. Dawid Majcherek & Marzenna Anna Weresa & Christina Ciecierski, 2020. "Understanding Regional Risk Factors for Cancer: A Cluster Analysis of Lifestyle, Environment and Socio-Economic Status in Poland," Sustainability, MDPI, vol. 12(21), pages 1-15, October.
    10. Sumeet Kumar & Binxuan Huang & Ramon Alfonso Villa Cox & Kathleen M. Carley, 2021. "An anatomical comparison of fake-news and trusted-news sharing pattern on Twitter," Computational and Mathematical Organization Theory, Springer, vol. 27(2), pages 109-133, June.
    11. Ricardo Lopez-Ruiz & Elyas Shivanian & Jose-Luis Lopez, 2013. "Random Market Models with an H-Theorem," Papers 1307.2169, arXiv.org, revised Jul 2014.
    12. Mustafa Koroglu & Yiguo Sun, 2016. "Functional-Coefficient Spatial Durbin Models with Nonparametric Spatial Weights: An Application to Economic Growth," Econometrics, MDPI, vol. 4(1), pages 1-16, February.
    13. Don Harding, 2010. "Applying shape and phase restrictions in generalized dynamic categorical models of the business cycle," NCER Working Paper Series 58, National Centre for Econometric Research.
    14. E. Samanidou & E. Zschischang & D. Stauffer & T. Lux, 2001. "Microscopic Models of Financial Markets," Papers cond-mat/0110354, arXiv.org.
    15. Rutten, Philip & Lees, Michael H. & Klous, Sander & Sloot, Peter M.A., 2021. "Intermittent and persistent movement patterns of dance event visitors in large sporting venues," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 563(C).
    16. Jovanovic, Franck & Schinckus, Christophe, 2016. "Breaking down the barriers between econophysics and financial economics," International Review of Financial Analysis, Elsevier, vol. 47(C), pages 256-266.
    17. Čížek, Pavel & Koo, Chao Hui, 2021. "Jump-preserving varying-coefficient models for nonlinear time series," Econometrics and Statistics, Elsevier, vol. 19(C), pages 58-96.
    18. Rama Cont & Jean-Philippe Bouchaud, 1997. "Herd behavior and aggregate fluctuations in financial markets," Science & Finance (CFM) working paper archive 500028, Science & Finance, Capital Fund Management.
    19. Wang, Yuanjun & You, Shibing, 2016. "An alternative method for modeling the size distribution of top wealth," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 457(C), pages 443-453.
    20. Khalilzadeh, Jalayer, 2022. "It is a small world, or is it? A look into two decades of tourism system," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 606(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0238835. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.