IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v14y2022i3p69-d756534.html
   My bibliography  Save this article

Transformer-Based Abstractive Summarization for Reddit and Twitter: Single Posts vs. Comment Pools in Three Languages

Author

Listed:
  • Ivan S. Blekanov

    (School of Mathematics and Computer Science, Yan’an University, Yan’an 716000, China
    Faculty of Applied Mathematics and Control Processes, St. Petersburg State University, 199034 St. Petersburg, Russia)

  • Nikita Tarasov

    (Faculty of Applied Mathematics and Control Processes, St. Petersburg State University, 199034 St. Petersburg, Russia)

  • Svetlana S. Bodrunova

    (School of Journalism and Mass Communications, St. Petersburg State University, 199034 St. Petersburg, Russia)

Abstract

ive summarization is a technique that allows for extracting condensed meanings from long texts, with a variety of potential practical applications. Nonetheless, today’s abstractive summarization research is limited to testing the models on various types of data, which brings only marginal improvements and does not lead to massive practical employment of the method. In particular, abstractive summarization is not used for social media research, where it would be very useful for opinion and topic mining due to the complications that social media data create for other methods of textual analysis. Of all social media, Reddit is most frequently used for testing new neural models of text summarization on large-scale datasets in English, without further testing on real-world smaller-size data in various languages or from various other platforms. Moreover, for social media, summarizing pools of texts (one-author posts, comment threads, discussion cascades, etc.) may bring crucial results relevant for social studies, which have not yet been tested. However, the existing methods of abstractive summarization are not fine-tuned for social media data and have next-to-never been applied to data from platforms beyond Reddit, nor for comments or non-English user texts. We address these research gaps by fine-tuning the newest Transformer-based neural network models LongFormer and T5 and testing them against BART, and on real-world data from Reddit, with improvements of up to 2%. Then, we apply the best model (fine-tuned T5) to pools of comments from Reddit and assess the similarity of post and comment summarizations. Further, to overcome the 500-token limitation of T5 for analyzing social media pools that are usually bigger, we apply LongFormer Large and T5 Large to pools of tweets from a large-scale discussion on the Charlie Hebdo massacre in three languages and prove that pool summarizations may be used for detecting micro-shifts in agendas of networked discussions. Our results show, however, that additional learning is definitely needed for German and French, as the results for these languages are non-satisfactory, and more fine-tuning is needed even in English for Twitter data. Thus, we show that a ‘one-for-all’ neural-network summarization model is still impossible to reach, while fine-tuning for platform affordances works well. We also show that fine-tuned T5 works best for small-scale social media data, but LongFormer is helpful for larger-scale pool summarizations.

Suggested Citation

  • Ivan S. Blekanov & Nikita Tarasov & Svetlana S. Bodrunova, 2022. "Transformer-Based Abstractive Summarization for Reddit and Twitter: Single Posts vs. Comment Pools in Three Languages," Future Internet, MDPI, vol. 14(3), pages 1-25, February.
  • Handle: RePEc:gam:jftint:v:14:y:2022:i:3:p:69-:d:756534
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/14/3/69/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/14/3/69/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Svetlana S. Bodrunova & Andrey V. Orekhov & Ivan S. Blekanov & Nikolay S. Lyudkevich & Nikita A. Tarasov, 2020. "Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment," Future Internet, MDPI, vol. 12(9), pages 1-17, August.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Svetlana S. Bodrunova, 2022. "Editorial for the Special Issue “Selected Papers from the 9th Annual Conference ‘Comparative Media Studies in Today’s World’ (CMSTW’2021)”," Future Internet, MDPI, vol. 14(11), pages 1-3, November.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Svetlana S. Bodrunova, 2022. "Editorial for the Special Issue “Selected Papers from the 9th Annual Conference ‘Comparative Media Studies in Today’s World’ (CMSTW’2021)”," Future Internet, MDPI, vol. 14(11), pages 1-3, November.
    2. Andrey V. Orekhov, 2021. "Quasi-Deterministic Processes with Monotonic Trajectories and Unsupervised Machine Learning," Mathematics, MDPI, vol. 9(18), pages 1-26, September.
    3. Ivan Blekanov & Svetlana S. Bodrunova & Askar Akhmetov, 2021. "Detection of Hidden Communities in Twitter Discussions of Varying Volumes," Future Internet, MDPI, vol. 13(11), pages 1-17, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:14:y:2022:i:3:p:69-:d:756534. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.