IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v67y2016i1p106-133.html
   My bibliography  Save this article

Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities

Author

Listed:
  • Tingting Mu
  • John Y. Goulermas
  • Ioannis Korkontzelos
  • Sophia Ananiadou

Abstract

type="main"> Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.

Suggested Citation

  • Tingting Mu & John Y. Goulermas & Ioannis Korkontzelos & Sophia Ananiadou, 2016. "Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(1), pages 106-133, January.
  • Handle: RePEc:bla:jinfst:v:67:y:2016:i:1:p:106-133
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1002/asi.23374
    Download Restriction: Access to full text is restricted to subscribers.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yueyang Zhao & Lei Cui, 2023. "Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(2), pages 1163-1186, February.
    2. Bianchi, Nicola & Carretta, Alessandro & Farina, Vincenzo & Fiordelisi, Franco, 2021. "Does espoused risk culture pay? Evidence from European banks," Journal of Banking & Finance, Elsevier, vol. 122(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:67:y:2016:i:1:p:106-133. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.