Author
Listed:
- Tiezheng Nie
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
- Hanyu Mao
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
- Aolin Liu
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
- Xuliang Wang
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
- Derong Shen
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
- Yue Kou
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China)
Abstract
Column semantic-type detection is a crucial task for data integration and schema matching, particularly when dealing with large volumes of unlabeled tabular data. Existing methods often rely on supervised learning models, which require extensive labeled data. In this paper, we propose SNMatch, an unsupervised approach based on a Siamese network for detecting column semantic types without labeled training data. The novelty of SNMatch lies in its ability to generate the semantic embeddings of columns by considering both format and semantic features and clustering them into semantic types. Unlike traditional methods, which typically rely on keyword matching or supervised classification, SNMatch leverages unsupervised learning to tackle the challenges of column semantic detection in massive datasets with limited labeled examples. We demonstrate that SNMatch significantly outperforms current state-of-the-art techniques in terms of clustering accuracy, especially in handling complex and nested semantic types. Extensive experiments on the MACST and VizNet-Manyeyes datasets validate its effectiveness, achieving superior performance in column semantic-type detection compared to methods such as TF-IDF, FastText, and BERT. The proposed method shows great promise for practical applications in data integration, data cleaning, and automated schema mapping, particularly in scenarios where labeled data are scarce or unavailable. Furthermore, our work builds upon recent advances in neural network-based embeddings and unsupervised learning, contributing to the growing body of research in automatic schema matching and tabular data understanding.
Suggested Citation
Tiezheng Nie & Hanyu Mao & Aolin Liu & Xuliang Wang & Derong Shen & Yue Kou, 2025.
"SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network,"
Mathematics, MDPI, vol. 13(4), pages 1-15, February.
Handle:
RePEc:gam:jmathe:v:13:y:2025:i:4:p:607-:d:1589900
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:13:y:2025:i:4:p:607-:d:1589900. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.