IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i15p2671-d874971.html
   My bibliography  Save this article

PCDM and PCDM4MP: New Pairwise Correlation-Based Data Mining Tools for Parallel Processing of Large Tabular Datasets

Author

Listed:
  • Daniel Homocianu

    (Department of Accounting, Business Information Systems, and Statistics, Faculty of Economics and Business Administration, Alexandru Ioan Cuza University, 700505 Jassy, Romania)

  • Dinu Airinei

    (Department of Accounting, Business Information Systems, and Statistics, Faculty of Economics and Business Administration, Alexandru Ioan Cuza University, 700505 Jassy, Romania)

Abstract

The paper describes PCDM and PCDM4MP as new tools and commands capable of exploring large datasets. They select variables based on identifying the absolute values of Pearson’s pairwise correlation coefficients between a chosen response variable and any other existing in the dataset. In addition, for each pair, they also report the corresponding significance and the number of non-null intersecting observations, and all this reporting is performed in a record-oriented manner (both source and output). Optionally, using threshold values for these three as parameters of PCDM, any user can select the most correlated variables based on high magnitude, significance, and support criteria. The syntax is simple, and the tools show the exploration progress in real-time. In addition, PCDM4MP can trigger different instances of Stata, each using a distinct class of variables belonging to the same dataset and resulting after simple name filtering (first letter). Moreover, this multi-processing (MP) version overcomes the parallelization limitations of the existing parallel module, and this is accomplished by using vertical instead of horizontal partitions of large flat datasets, dynamic generation of the task pattern, tasks, and logs, all within a single execution of this second command, and the existing qsub module to automatically and continuously allocate the tasks to logical processors and thereby emulating with fewer resources a cluster environment. In addition, any user can perform further selections based on the results printed in the console. The paper contains examples of using these tools for large datasets such as the one belonging to the World Values Survey and based on a simple variable naming practice. This article includes many recorded simulations and presents performance results. They depend on different resources and hardware configurations used, including cloud vs. on-premises, large vs. small amounts of RAM and processing cores, and in-memory vs. traditional storage.

Suggested Citation

  • Daniel Homocianu & Dinu Airinei, 2022. "PCDM and PCDM4MP: New Pairwise Correlation-Based Data Mining Tools for Parallel Processing of Large Tabular Datasets," Mathematics, MDPI, vol. 10(15), pages 1-27, July.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:15:p:2671-:d:874971
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/15/2671/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/15/2671/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Oliver KOPF & Daniel HOMOCIANU, 2016. "The Business Intelligence Based Business Process Management Challenge," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 20(1), pages 7-19.
    2. Ali Reza Sadeghi & Yasaman Bahadori, 2021. "Urban Sustainability and Climate Issues: The Effect of Physical Parameters of Streetscape on the Thermal Comfort in Urban Public Spaces; Case Study: Karimkhan-e-Zand Street, Shiraz, Iran," Sustainability, MDPI, vol. 13(19), pages 1-23, September.
    3. Giampiero Giacomello & Damiano Martinelli, 2021. "Crystal Clear: Investigating Databases for Research, the Case of Drone Strikes," Data, MDPI, vol. 6(12), pages 1-18, November.
    4. Seng Boon Lim & Jalaluddin Abdul Malek & Tan Yigitcanlar, 2021. "Post-Materialist Values of Smart City Societies: International Comparison of Public Values for Good Enough Governance," Future Internet, MDPI, vol. 13(8), pages 1-13, August.
    5. Dinu AIRINEI & Daniel HOMOCIANU, 2009. "The Geographical Dimension Of Dss Applications," Analele Stiintifice ale Universitatii "Alexandru Ioan Cuza" din Iasi - Stiinte Economice (1954-2015), Alexandru Ioan Cuza University, Faculty of Economics and Business Administration, vol. 56, pages 637-642, November.
    6. Marcus R. Munafò & George Davey Smith, 2018. "Robust research needs many lines of evidence," Nature, Nature, vol. 553(7689), pages 399-401, January.
    7. Achim Ahrens & Christian B. Hansen & Mark E. Schaffer, 2020. "lassopack: Model selection and prediction with regularized regression in Stata," Stata Journal, StataCorp LP, vol. 20(1), pages 176-235, March.
    8. Maria Cristina Sierras-Davo & Manuel Lillo-Crespo & Patricia Verdu & Aimilia Karapostoli, 2021. "Transforming the Future Healthcare Workforce across Europe through Improvement Science Training: A Qualitative Approach," IJERPH, MDPI, vol. 18(3), pages 1-8, February.
    9. Manuela Ortega-Gil & Antonio Mata García & Chaima ElHichou-Ahmed, 2021. "The Effect of Ageing, Gender and Environmental Problems in Subjective Well-Being," Land, MDPI, vol. 10(12), pages 1-14, November.
    10. Rania S. Miniesy & Mariam AbdelKarim, 2021. "Generalized Trust and Economic Growth: The Nexus in MENA Countries," Economies, MDPI, vol. 9(1), pages 1-22, March.
    11. Fakih, Ali & Makdissi, Paul & Marrouch, Walid & Tabri, Rami V. & Yazbeck, Myra, 2022. "A stochastic dominance test under survey nonresponse with an application to comparing trust levels in Lebanese public institutions," Journal of Econometrics, Elsevier, vol. 228(2), pages 342-358.
    12. Keiichi Hayashi & Lizzida P. Llorca & Iris D. Bugayong & Nurwulan Agustiani & Ailon Oliver V. Capistrano, 2021. "Evaluating the Predictive Accuracy of the Weather-Rice-Nutrient Integrated Decision Support System (WeRise) to Improve Rainfed Rice Productivity in Southeast Asia," Agriculture, MDPI, vol. 11(4), pages 1-13, April.
    13. Matthias Schonlau, 2005. "Boosted regression (boosting): An introductory tutorial and a Stata plugin," Stata Journal, StataCorp LP, vol. 5(3), pages 330-354, September.
    14. Alexander Zlotnik & Victor Abraira, 2015. "A general-purpose nomogram generator for predictive logistic regression models," Stata Journal, StataCorp LP, vol. 15(2), pages 537-546, June.
    15. Giuseppe De Luca & Jan R. Magnus, 2011. "Bayesian model averaging and weighted-average least squares: Equivariance, stability, and numerical issues," Stata Journal, StataCorp LP, vol. 11(4), pages 518-544, December.
    16. Daniel Homocianu & Aurelian-Petruș Plopeanu & Rodica Ianole-Calin, 2021. "A Robust Approach for Identifying the Major Components of the Bribery Tolerance Index," Mathematics, MDPI, vol. 9(13), pages 1-20, July.
    17. Kingston Rajiah & Shreeta Sivarasa & Mari Kannan Maharajan, 2021. "Impact of Pharmacists’ Interventions and Patients’ Decision on Health Outcomes in Terms of Medication Adherence and Quality Use of Medicines among Patients Attending Community Pharmacies: A Systematic," IJERPH, MDPI, vol. 18(9), pages 1-14, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daniel Homocianu, 2023. "Exploring the Predictors of Co-Nationals’ Preference over Immigrants in Accessing Jobs—Evidence from World Values Survey," Mathematics, MDPI, vol. 11(3), pages 1-29, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Daniel Homocianu, 2023. "Exploring the Predictors of Co-Nationals’ Preference over Immigrants in Accessing Jobs—Evidence from World Values Survey," Mathematics, MDPI, vol. 11(3), pages 1-29, February.
    2. Daniel Homocianu & Octavian Dospinescu & Napoleon-Alexandru Sireteanu, 2022. "Exploring the Influences of Job Satisfaction for Europeans Aged 50 + from Ex-communist vs. Non-communist Countries," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 159(1), pages 235-279, January.
    3. Aurelian-Petruș Plopeanu & Daniel Homocianu & Nelu Florea & Ovidiu-Aurel Ghiuță & Dinu Airinei, 2019. "Comparative Patterns of Migration Intentions: Evidence from Eastern European Students in Economics from Romania and Republic of Moldova," Sustainability, MDPI, vol. 11(18), pages 1-21, September.
    4. Irene Mosca & Alan Barrett, 2016. "The impact of adult child emigration on the mental health of older parents," Journal of Population Economics, Springer;European Society for Population Economics, vol. 29(3), pages 687-719, July.
    5. Biewen, Martin & Erhardt, Pascal, 2024. "Using Post-Regularization Distribution Regression to Measure the Effects of a Minimum Wage on Hourly Wages, Hours Worked and Monthly Earnings," IZA Discussion Papers 16894, Institute of Labor Economics (IZA).
    6. Hagen, Tobias, 2013. "Impact of national financial regulation on macroeconomic and fiscal performance after the 2007 financial stock: Econometric analyses based on cross-country data," Working Paper Series 02, Frankfurt University of Applied Sciences, Faculty of Business and Law.
    7. Dimos, Christos & Pugh, Geoff & Hisarciklilar, Mehtap & Talam, Ema & Jackson, Ian, 2022. "The relative effectiveness of R&D tax credits and R&D subsidies: A comparative meta-regression analysis," Technovation, Elsevier, vol. 115(C).
    8. Mattia Filomena & Matteo Picchio, 2023. "Retirement and health outcomes in a meta‐analytical framework," Journal of Economic Surveys, Wiley Blackwell, vol. 37(4), pages 1120-1155, September.
    9. Kanga, Désiré & Soumaré, Issouf & Amenounvé, Edoh, 2023. "Can corporate financing through the stock market create systemic risk? Evidence from the BRVM securities market," Emerging Markets Review, Elsevier, vol. 55(C).
    10. Demena, B.A., 2021. "Effectiveness of export promotion programmes," ISS Working Papers - General Series 688, International Institute of Social Studies of Erasmus University Rotterdam (ISS), The Hague.
    11. Gorodnichenko, Yuriy & Pham, Tho & Talavera, Oleksandr, 2021. "Conference presentations and academic publishing," Economic Modelling, Elsevier, vol. 95(C), pages 228-254.
    12. Liu Yang & Yuanqing Wang & Yujun Lian & Zhongming Guo & Yuanyuan Liu & Zhouhao Wu & Tieyue Zhang, 2022. "Key Factors, Planning Strategy and Policy for Low-Carbon Transport Development in Developing Cities of China," IJERPH, MDPI, vol. 19(21), pages 1-14, October.
    13. Madhan Balasubramanian & Stephanie Short, 2021. "The Future Health Workforce: Integrated Solutions and Models of Care," IJERPH, MDPI, vol. 18(6), pages 1-4, March.
    14. Mark F. J. Steel, 2020. "Model Averaging and Its Use in Economics," Journal of Economic Literature, American Economic Association, vol. 58(3), pages 644-719, September.
    15. Nelson, Kelly P. & Parton, Lee C. & Brown, Zachary S., 2022. "Biofuels policy and innovation impacts: Evidence from biofuels and agricultural patent indicators," Energy Policy, Elsevier, vol. 162(C).
    16. Paul Makdissi & Walid Marrouch & Myra Yazbeck, 2022. "Monitoring Poverty in a Data Deprived Environment: The Case of Lebanon," Working Papers 2022-014, Human Capital and Economic Opportunity Working Group.
    17. Santos, Luca J. & Oliveira, Alessandro V.M. & Aldrighi, Dante Mendes, 2021. "Testing the differentiated impact of the COVID-19 pandemic on air travel demand considering social inclusion," Journal of Air Transport Management, Elsevier, vol. 94(C).
    18. Mathonnat, Clément & Williams, Benjamin, 2020. "Does more finance mean more inequality in times of crisis?," Economic Systems, Elsevier, vol. 44(4).
    19. Hagen, Tobias, 2013. "The impact of national financial regulation on macroeconomic and fiscal performance after the 2007 financial shock: Econometric analyses based on cross-country data," Economics - The Open-Access, Open-Assessment E-Journal (2007-2020), Kiel Institute for the World Economy (IfW Kiel), vol. 7, pages 1-44.
    20. Achim Ahrens & Sean Lyons, 2021. "Do rising rents lead to longer commutes? A gravity model of commuting flows in Ireland," Urban Studies, Urban Studies Journal Limited, vol. 58(2), pages 264-279, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:15:p:2671-:d:874971. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.