IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v3y2018i4p66-d190245.html
   My bibliography  Save this article

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Author

Listed:
  • Svitlana Petrasova

    (Department of Intelligent Computer Systems, National Technical University “Kharkiv Polytechnic Institute”, 61002 Kharkiv, Ukraine)

  • Nina Khairova

    (Department of Intelligent Computer Systems, National Technical University “Kharkiv Polytechnic Institute”, 61002 Kharkiv, Ukraine)

  • Włodzimierz Lewoniewski

    (Department of Information Systems, Poznan University of Economics and Business, 61-875 Poznan, Poland)

  • Orken Mamyrbayev

    (Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan)

  • Kuralay Mukhsina

    (Department of Informatics, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan)

Abstract

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.

Suggested Citation

  • Svitlana Petrasova & Nina Khairova & Włodzimierz Lewoniewski & Orken Mamyrbayev & Kuralay Mukhsina, 2018. "Similar Text Fragments Extraction for Identifying Common Wikipedia Communities," Data, MDPI, vol. 3(4), pages 1-9, December.
  • Handle: RePEc:gam:jdataj:v:3:y:2018:i:4:p:66-:d:190245
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/3/4/66/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/3/4/66/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kevin W. Boyack & Henry Small & Richard Klavans, 2013. "Improving the accuracy of co‐citation clustering using full text," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 64(9), pages 1759-1767, September.
    2. Jiann-wien Hsu & Ding-wei Huang, 2011. "Correlation between impact and collaboration," Scientometrics, Springer;Akadémiai Kiadó, vol. 86(2), pages 317-324, February.
    3. David Guy Brizan & Kevin Gallagher & Arnab Jahangir & Theodore Brown, 2016. "Predicting citation patterns: defining and determining influence," Scientometrics, Springer;Akadémiai Kiadó, vol. 108(1), pages 183-200, July.
    4. Kevin W. Boyack & Henry Small & Richard Klavans, 2013. "Improving the accuracy of co-citation clustering using full text," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(9), pages 1759-1767, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    2. Kun Sun & Haitao Liu & Wenxin Xiong, 2021. "The evolutionary pattern of language in scientific writings: A case study of Philosophical Transactions of Royal Society (1665–1869)," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1695-1724, February.
    3. Mengyu Yu & Mazie Krehbiel & Samantha Thompson & Tatjana Miljkovic, 2020. "An exploration of gender gap using advanced data science tools: actuarial research community," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 767-789, May.
    4. Yun, Jinhyuk, 2022. "Generalization of bibliographic coupling and co-citation using the node split network," Journal of Informetrics, Elsevier, vol. 16(2).
    5. Ruhao Zhang & Junpeng Yuan, 2022. "Enhanced author bibliographic coupling analysis using semantic and syntactic citation information," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7681-7706, December.
    6. Riaz Ahmad & Muhammad Tanvir Afzal, 2018. "CAD: an algorithm for citation-anchors detection in research papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 1405-1423, December.
    7. Kamal Sanguri & Atanu Bhuyan & Sabyasachi Patra, 2020. "A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 233-269, October.
    8. Dangzhi Zhao & Andreas Strotmann, 2020. "Deep and narrow impact: introducing location filtered citation counting," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 503-517, January.
    9. Raja Habib & Muhammad Tanvir Afzal, 2019. "Sections-based bibliographic coupling for research paper recommendation," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 643-656, May.
    10. Li Zhang & Ming Liu & Bo Wang & Bo Lang & Peng Yang, 2021. "Discovering communities based on mention distance," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 1945-1967, March.
    11. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    12. Rey-Long Liu, 2017. "A new bibliographic coupling measure with descriptive capability," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 915-935, February.
    13. Yves Gingras & Mahdi Khelfaoui, 2018. "Assessing the effect of the United States’ “citation advantage” on other countries’ scientific impact as measured in the Web of Science (WoS) database," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(2), pages 517-532, February.
    14. Small, Henry & Tseng, Hung & Patek, Mike, 2017. "Discovering discoveries: Identifying biomedical discoveries using citation contexts," Journal of Informetrics, Elsevier, vol. 11(1), pages 46-62.
    15. Hanna-Mari Puuska & Reetta Muhonen & Yrjö Leino, 2014. "International and domestic co-publishing and their citation impact in different disciplines," Scientometrics, Springer;Akadémiai Kiadó, vol. 98(2), pages 823-839, February.
    16. Kim, Ha Jin & Jeong, Yoo Kyung & Song, Min, 2016. "Content- and proximity-based author co-citation analysis using citation sentences," Journal of Informetrics, Elsevier, vol. 10(4), pages 954-966.
    17. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    18. Vanclay, Jerome K., 2013. "Factors affecting citation rates in environmental science," Journal of Informetrics, Elsevier, vol. 7(2), pages 265-271.
    19. Dewan F. Wahid & Elkafi Hassini, 2022. "A Literature Review on Correlation Clustering: Cross-disciplinary Taxonomy with Bibliometric Analysis," SN Operations Research Forum, Springer, vol. 3(3), pages 1-42, September.
    20. Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:3:y:2018:i:4:p:66-:d:190245. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.