IDEAS home Printed from https://ideas.repec.org/a/eee/phsmap/v391y2012i18p4406-4419.html
   My bibliography  Save this article

Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts

Author

Listed:
  • Amancio, Diego R.
  • Oliveira Jr., Osvaldo N.
  • Costa, Luciano da F.

Abstract

The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies.

Suggested Citation

  • Amancio, Diego R. & Oliveira Jr., Osvaldo N. & Costa, Luciano da F., 2012. "Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(18), pages 4406-4419.
  • Handle: RePEc:eee:phsmap:v:391:y:2012:i:18:p:4406-4419
    DOI: 10.1016/j.physa.2012.04.011
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0378437112003044
    Download Restriction: Full text for ScienceDirect subscribers only. Journal offers the option of making the article available online on Science direct for a fee of $3,000

    File URL: https://libkey.io/10.1016/j.physa.2012.04.011?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Amancio, D.R. & Nunes, M.G.V. & Oliveira, O.N. & Pardo, T.A.S. & Antiqueira, L. & da F. Costa, L., 2011. "Using metrics from complex networks to evaluate machine translation," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 390(1), pages 131-142.
    2. Duncan J. Watts, 2007. "A twenty-first century science," Nature, Nature, vol. 445(7127), pages 489-489, February.
    3. Amancio, Diego R. & Nunes, Maria G.V. & Oliveira, Osvaldo N. & Costa, Luciano da F., 2012. "Extractive summarization using complex networks and syntactic dependency," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(4), pages 1855-1864.
    4. Liu, Haitao, 2008. "The complexity of Chinese syntactic dependency networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 387(12), pages 3048-3058.
    5. Antiqueira, L. & Nunes, M.G.V. & Oliveira Jr., O.N. & F. Costa, L. da, 2007. "Strong correlations between text quality and complex networks features," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 373(C), pages 811-820.
    6. Silva, F.N. & Viana, M.P. & Travençolo, B.A.N. & Costa, L. da F., 2011. "Investigating relationships within and between category networks in Wikipedia," Journal of Informetrics, Elsevier, vol. 5(3), pages 431-438.
    7. Steve Lawrence & C. Lee Giles, 1999. "Accessibility of information on the web," Nature, Nature, vol. 400(6740), pages 107-107, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. de Arruda, Henrique F. & Marinho, Vanessa Q. & Lima, Thales S. & Amancio, Diego R. & Costa, Luciano da F., 2018. "An image analysis approach to text analytics based on complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 510(C), pages 110-120.
    2. Bian, Tian & Hu, Jiantao & Deng, Yong, 2017. "Identifying influential nodes in complex networks based on AHP," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 479(C), pages 422-436.
    3. Yin, Likang & Deng, Yong, 2018. "Toward uncertainty of weighted networks: An entropy-based model," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 508(C), pages 176-186.
    4. Tohalino, Jorge V. & Amancio, Diego R., 2018. "Extractive multi-document summarization using multilayer networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 503(C), pages 526-539.
    5. Nguyen Minh Tien & Cyril Labbé, 2018. "Detecting automatically generated sentences with grammatical structure similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1247-1271, August.
    6. Akimushkin, Camilo & Amancio, Diego R. & Oliveira, Osvaldo N., 2018. "On the role of words in the network structure of texts: Application to authorship attribution," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 495(C), pages 49-58.
    7. Woon Peng Goh & Kang-Kwong Luke & Siew Ann Cheong, 2018. "Functional shortcuts in language co-occurrence networks," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-18, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Diego Raphael Amancio, 2015. "Comparing the topological properties of real and artificially generated scientific manuscripts," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1763-1779, December.
    2. Diego R Amancio, 2015. "Probing the Topological Properties of Complex Networks Modeling Short Written Texts," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-17, February.
    3. Liu, Yanyan & Li, Keping & Yan, Dongyang & Gu, Shuang, 2022. "A network-based CNN model to identify the hidden information in text data," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 590(C).
    4. Cui, Xue-Mei & Yoon, Chang No & Youn, Hyejin & Lee, Sang Hoon & Jung, Jean S. & Han, Seung Kee, 2017. "Dynamic burstiness of word-occurrence and network modularity in textbook systems," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 487(C), pages 103-110.
    5. Tohalino, Jorge V. & Amancio, Diego R., 2018. "Extractive multi-document summarization using multilayer networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 503(C), pages 526-539.
    6. Silva, Filipi N. & Amancio, Diego R. & Bardosova, Maria & Costa, Luciano da F. & Oliveira, Osvaldo N., 2016. "Using network science and text analytics to produce surveys in a scientific topic," Journal of Informetrics, Elsevier, vol. 10(2), pages 487-502.
    7. Amancio, Diego R. & Nunes, Maria G.V. & Oliveira, Osvaldo N. & Costa, Luciano da F., 2012. "Extractive summarization using complex networks and syntactic dependency," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(4), pages 1855-1864.
    8. D. R. Amancio & M. G. V. Nunes & O. N. Oliveira & L. F. Costa, 2012. "Using complex networks concepts to assess approaches for citations in scientific papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(3), pages 827-842, June.
    9. Liang, Wei & Shi, Yuming & Huang, Qiuling, 2014. "Modeling the Chinese language as an evolving network," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 393(C), pages 268-276.
    10. Gandal, Neil, 2001. "The dynamics of competition in the internet search engine market," International Journal of Industrial Organization, Elsevier, vol. 19(7), pages 1103-1117, July.
    11. Chong Myung Park & Angelica Rodriguez & Jazmin Rubi Flete Gomez & Isahiah Erilus & Hayoung Kim Donnelly & Yanling Dai & Alexandra Oliver-Davila & Paul Trunfio & Cecilia Nardi & Kimberly A. S. Howard &, 2021. "Embedding Life Design in Future Readiness Efforts to Promote Collective Impact and Economically Sustainable Communities: Conceptual Frameworks and Case Example," Sustainability, MDPI, vol. 13(23), pages 1-17, November.
    12. Wei, Daijun & Deng, Xinyang & Zhang, Xiaoge & Deng, Yong & Mahadevan, Sankaran, 2013. "Identifying influential nodes in weighted networks based on evidence theory," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 392(10), pages 2564-2575.
    13. Letchford, Adrian & Preis, Tobias & Moat, Helen Susannah, 2016. "The advantage of simple paper abstracts," Journal of Informetrics, Elsevier, vol. 10(1), pages 1-8.
    14. Jiang, Jingchi & Zheng, Jichuan & Zhao, Chao & Su, Jia & Guan, Yi & Yu, Qiubin, 2016. "Clinical-decision support based on medical literature: A complex network approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 459(C), pages 42-54.
    15. Nicolas Jullien, 2012. "What We Know About Wikipedia: A Review of the Literature Analyzing the Project(s)," Post-Print hal-00857208, HAL.
    16. Eric T. Bradlow & David C. Schmittlein, 2000. "The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines," Marketing Science, INFORMS, vol. 19(1), pages 43-62, June.
    17. Judit Bar-Ilan, 2001. "Data collection methods on the Web for infometric purposes — A review and analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 50(1), pages 7-32, January.
    18. Antal Bosch & Toine Bogers & Maurice Kunder, 2016. "Estimating search engine index size variability: a 9-year longitudinal study," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(2), pages 839-856, May.
    19. Wang, Yanhui & Bi, Lifeng & Lin, Shuai & Li, Man & Shi, Hao, 2017. "A complex network-based importance measure for mechatronics systems," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 466(C), pages 180-198.
    20. Adilson Vital & Diego R. Amancio, 2022. "A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(10), pages 6011-6028, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:phsmap:v:391:y:2012:i:18:p:4406-4419. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.journals.elsevier.com/physica-a-statistical-mechpplications/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.