IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i4p506-d1334608.html
   My bibliography  Save this article

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Author

Listed:
  • Alberto Postiglione

    (Department of Business Science and Management & Innovation Systems, University of Salerno, Via San Giovanni Paolo II, 84084 Fisciano, Italy)

Abstract

Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.

Suggested Citation

  • Alberto Postiglione, 2024. "Finite State Automata on Multi-Word Units for Efficient Text-Mining," Mathematics, MDPI, vol. 12(4), pages 1-20, February.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:4:p:506-:d:1334608
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/4/506/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/4/506/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:4:p:506-:d:1334608. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.