IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i15p2671-d874971.html
   My bibliography  Save this article

PCDM and PCDM4MP: New Pairwise Correlation-Based Data Mining Tools for Parallel Processing of Large Tabular Datasets

Author

Listed:
  • Daniel Homocianu

    (Department of Accounting, Business Information Systems, and Statistics, Faculty of Economics and Business Administration, Alexandru Ioan Cuza University, 700505 Jassy, Romania)

  • Dinu Airinei

    (Department of Accounting, Business Information Systems, and Statistics, Faculty of Economics and Business Administration, Alexandru Ioan Cuza University, 700505 Jassy, Romania)

Abstract

The paper describes PCDM and PCDM4MP as new tools and commands capable of exploring large datasets. They select variables based on identifying the absolute values of Pearson’s pairwise correlation coefficients between a chosen response variable and any other existing in the dataset. In addition, for each pair, they also report the corresponding significance and the number of non-null intersecting observations, and all this reporting is performed in a record-oriented manner (both source and output). Optionally, using threshold values for these three as parameters of PCDM, any user can select the most correlated variables based on high magnitude, significance, and support criteria. The syntax is simple, and the tools show the exploration progress in real-time. In addition, PCDM4MP can trigger different instances of Stata, each using a distinct class of variables belonging to the same dataset and resulting after simple name filtering (first letter). Moreover, this multi-processing (MP) version overcomes the parallelization limitations of the existing parallel module, and this is accomplished by using vertical instead of horizontal partitions of large flat datasets, dynamic generation of the task pattern, tasks, and logs, all within a single execution of this second command, and the existing qsub module to automatically and continuously allocate the tasks to logical processors and thereby emulating with fewer resources a cluster environment. In addition, any user can perform further selections based on the results printed in the console. The paper contains examples of using these tools for large datasets such as the one belonging to the World Values Survey and based on a simple variable naming practice. This article includes many recorded simulations and presents performance results. They depend on different resources and hardware configurations used, including cloud vs. on-premises, large vs. small amounts of RAM and processing cores, and in-memory vs. traditional storage.

Suggested Citation

  • Daniel Homocianu & Dinu Airinei, 2022. "PCDM and PCDM4MP: New Pairwise Correlation-Based Data Mining Tools for Parallel Processing of Large Tabular Datasets," Mathematics, MDPI, vol. 10(15), pages 1-27, July.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:15:p:2671-:d:874971
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/15/2671/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/15/2671/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Achim Ahrens & Christian B. Hansen & Mark E. Schaffer, 2020. "lassopack: Model selection and prediction with regularized regression in Stata," Stata Journal, StataCorp LP, vol. 20(1), pages 176-235, March.
    2. Marcus R. Munafò & George Davey Smith, 2018. "Robust research needs many lines of evidence," Nature, Nature, vol. 553(7689), pages 399-401, January.
    3. Oliver KOPF & Daniel HOMOCIANU, 2016. "The Business Intelligence Based Business Process Management Challenge," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 20(1), pages 7-19.
    4. Daniel Homocianu & Aurelian-Petruș Plopeanu & Rodica Ianole-Calin, 2021. "A Robust Approach for Identifying the Major Components of the Bribery Tolerance Index," Mathematics, MDPI, vol. 9(13), pages 1-20, July.
    5. Maria Cristina Sierras-Davo & Manuel Lillo-Crespo & Patricia Verdu & Aimilia Karapostoli, 2021. "Transforming the Future Healthcare Workforce across Europe through Improvement Science Training: A Qualitative Approach," IJERPH, MDPI, vol. 18(3), pages 1-8, February.
    6. Fakih, Ali & Makdissi, Paul & Marrouch, Walid & Tabri, Rami V. & Yazbeck, Myra, 2022. "A stochastic dominance test under survey nonresponse with an application to comparing trust levels in Lebanese public institutions," Journal of Econometrics, Elsevier, vol. 228(2), pages 342-358.
    7. Ali Reza Sadeghi & Yasaman Bahadori, 2021. "Urban Sustainability and Climate Issues: The Effect of Physical Parameters of Streetscape on the Thermal Comfort in Urban Public Spaces; Case Study: Karimkhan-e-Zand Street, Shiraz, Iran," Sustainability, MDPI, vol. 13(19), pages 1-23, September.
    8. Alexander Zlotnik & Victor Abraira, 2015. "A general-purpose nomogram generator for predictive logistic regression models," Stata Journal, StataCorp LP, vol. 15(2), pages 537-546, June.
    9. Giuseppe De Luca & Jan R. Magnus, 2011. "Bayesian model averaging and weighted-average least squares: Equivariance, stability, and numerical issues," Stata Journal, StataCorp LP, vol. 11(4), pages 518-544, December.
    10. Manuela Ortega-Gil & Antonio Mata García & Chaima ElHichou-Ahmed, 2021. "The Effect of Ageing, Gender and Environmental Problems in Subjective Well-Being," Land, MDPI, vol. 10(12), pages 1-14, November.
    11. Rania S. Miniesy & Mariam AbdelKarim, 2021. "Generalized Trust and Economic Growth: The Nexus in MENA Countries," Economies, MDPI, vol. 9(1), pages 1-22, March.
    12. Kingston Rajiah & Shreeta Sivarasa & Mari Kannan Maharajan, 2021. "Impact of Pharmacists’ Interventions and Patients’ Decision on Health Outcomes in Terms of Medication Adherence and Quality Use of Medicines among Patients Attending Community Pharmacies: A Systematic," IJERPH, MDPI, vol. 18(9), pages 1-14, April.
    13. Keiichi Hayashi & Lizzida P. Llorca & Iris D. Bugayong & Nurwulan Agustiani & Ailon Oliver V. Capistrano, 2021. "Evaluating the Predictive Accuracy of the Weather-Rice-Nutrient Integrated Decision Support System (WeRise) to Improve Rainfed Rice Productivity in Southeast Asia," Agriculture, MDPI, vol. 11(4), pages 1-13, April.
    14. Matthias Schonlau, 2005. "Boosted regression (boosting): An introductory tutorial and a Stata plugin," Stata Journal, StataCorp LP, vol. 5(3), pages 330-354, September.
    15. Giampiero Giacomello & Damiano Martinelli, 2021. "Crystal Clear: Investigating Databases for Research, the Case of Drone Strikes," Data, MDPI, vol. 6(12), pages 1-18, November.
    16. Seng Boon Lim & Jalaluddin Abdul Malek & Tan Yigitcanlar, 2021. "Post-Materialist Values of Smart City Societies: International Comparison of Public Values for Good Enough Governance," Future Internet, MDPI, vol. 13(8), pages 1-13, August.
    17. Dinu AIRINEI & Daniel HOMOCIANU, 2009. "The Geographical Dimension Of Dss Applications," Analele Stiintifice ale Universitatii "Alexandru Ioan Cuza" din Iasi - Stiinte Economice (1954-2015), Alexandru Ioan Cuza University, Faculty of Economics and Business Administration, vol. 56, pages 637-642, November.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daniel Homocianu, 2023. "Exploring the Predictors of Co-Nationals’ Preference over Immigrants in Accessing Jobs—Evidence from World Values Survey," Mathematics, MDPI, vol. 11(3), pages 1-29, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Daniel Homocianu, 2024. "Life Satisfaction: Insights from the World Values Survey," Societies, MDPI, vol. 14(7), pages 1-41, July.
    2. Daniel Homocianu, 2023. "Exploring the Predictors of Co-Nationals’ Preference over Immigrants in Accessing Jobs—Evidence from World Values Survey," Mathematics, MDPI, vol. 11(3), pages 1-29, February.
    3. Daniel Homocianu & Octavian Dospinescu & Napoleon-Alexandru Sireteanu, 2022. "Exploring the Influences of Job Satisfaction for Europeans Aged 50 + from Ex-communist vs. Non-communist Countries," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 159(1), pages 235-279, January.
    4. Aurelian-Petruș Plopeanu & Daniel Homocianu & Nelu Florea & Ovidiu-Aurel Ghiuță & Dinu Airinei, 2019. "Comparative Patterns of Migration Intentions: Evidence from Eastern European Students in Economics from Romania and Republic of Moldova," Sustainability, MDPI, vol. 11(18), pages 1-21, September.
    5. Jingfeng Zhao & Fan Sun, 2023. "Study on the Influence Mechanism and Adjustment Path of Climate Risk on China’s High-Quality Economic Development," Sustainability, MDPI, vol. 15(12), pages 1-19, June.
    6. Irene Mosca & Alan Barrett, 2016. "The impact of adult child emigration on the mental health of older parents," Journal of Population Economics, Springer;European Society for Population Economics, vol. 29(3), pages 687-719, July.
    7. Srdelić, Leonarda & Dávila-Fernández, Marwil J., 2024. "International trade and economic growth in Croatia," Structural Change and Economic Dynamics, Elsevier, vol. 68(C), pages 240-258.
    8. Julia Estefania‐Flores & Davide Furceri & Siddharth Kothari & Jonathan D. Ostry, 2023. "Worse than you think: Public debt forecast errors in advanced and developing economies," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(3), pages 685-714, April.
    9. Nelson, Kelly P. & Parton, Lee C. & Brown, Zachary S., 2022. "Biofuels policy and innovation impacts: Evidence from biofuels and agricultural patent indicators," Energy Policy, Elsevier, vol. 162(C).
    10. Yun Qiu & Xi Chen & Wei Shi, 2020. "Impacts of social and economic factors on the transmission of coronavirus disease 2019 (COVID-19) in China," Journal of Population Economics, Springer;European Society for Population Economics, vol. 33(4), pages 1127-1172, October.
    11. Biewen, Martin & Erhardt, Pascal, 2024. "Using Post-Regularization Distribution Regression to Measure the Effects of a Minimum Wage on Hourly Wages, Hours Worked and Monthly Earnings," IZA Discussion Papers 16894, Institute of Labor Economics (IZA).
    12. Hagen, Tobias, 2013. "Impact of national financial regulation on macroeconomic and fiscal performance after the 2007 financial stock: Econometric analyses based on cross-country data," Working Paper Series 02, Frankfurt University of Applied Sciences, Faculty of Business and Law.
    13. Mattia Filomena & Matteo Picchio, 2023. "Retirement and health outcomes in a meta‐analytical framework," Journal of Economic Surveys, Wiley Blackwell, vol. 37(4), pages 1120-1155, September.
    14. Dimos, Christos & Pugh, Geoff & Hisarciklilar, Mehtap & Talam, Ema & Jackson, Ian, 2022. "The relative effectiveness of R&D tax credits and R&D subsidies: A comparative meta-regression analysis," Technovation, Elsevier, vol. 115(C).
    15. Eduardo Correia & Rodrigo Calili & José Francisco Pessanha & Maria Fatima Almeida, 2023. "Definition of Regulatory Targets for Electricity Non-Technical Losses: Proposition of an Automatic Model-Selection Technique for Panel Data Regressions," Energies, MDPI, vol. 16(6), pages 1-22, March.
    16. Suah, Jing Lian, 2020. "Uncertainty and Exchange Rates: Global Dynamics (Well, I Don't Quite Know Anymore)," MPRA Paper 109087, University Library of Munich, Germany.
    17. Kanga, Désiré & Soumaré, Issouf & Amenounvé, Edoh, 2023. "Can corporate financing through the stock market create systemic risk? Evidence from the BRVM securities market," Emerging Markets Review, Elsevier, vol. 55(C).
    18. Anya Topiwala & Kulveer Mankia & Steven Bell & Alastair Webb & Klaus P. Ebmeier & Isobel Howard & Chaoyue Wang & Fidel Alfaro-Almagro & Karla Miller & Stephen Burgess & Stephen Smith & Thomas E. Nicho, 2023. "Association of gout with brain reserve and vulnerability to neurodegenerative disease," Nature Communications, Nature, vol. 14(1), pages 1-9, December.
    19. Demena, B.A., 2021. "Effectiveness of export promotion programmes," ISS Working Papers - General Series 688, International Institute of Social Studies of Erasmus University Rotterdam (ISS), The Hague.
    20. Gorodnichenko, Yuriy & Pham, Tho & Talavera, Oleksandr, 2021. "Conference presentations and academic publishing," Economic Modelling, Elsevier, vol. 95(C), pages 228-254.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:15:p:2671-:d:874971. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.