IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i24p4744-d1002989.html
   My bibliography  Save this article

Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

Author

Listed:
  • Pauline O’Shaughnessy

    (School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, NSW 2522, Australia
    These authors contributed equally to this work.)

  • Yan-Xia Lin

    (School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, NSW 2522, Australia
    These authors contributed equally to this work.)

Abstract

In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.

Suggested Citation

  • Pauline O’Shaughnessy & Yan-Xia Lin, 2022. "Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering," Mathematics, MDPI, vol. 10(24), pages 1-13, December.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:24:p:4744-:d:1002989
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/24/4744/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/24/4744/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:24:p:4744-:d:1002989. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.