IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v051i05.html
   My bibliography  Save this article

A tm Plug-In for Distributed Text Mining in R

Author

Listed:
  • Theußl, Stefan
  • Feinerer, Ingo
  • Hornik, Kurt

Abstract

R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

Suggested Citation

  • Theußl, Stefan & Feinerer, Ingo & Hornik, Kurt, 2012. "A tm Plug-In for Distributed Text Mining in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 51(i05).
  • Handle: RePEc:jss:jstsof:v:051:i05
    DOI: http://hdl.handle.net/10.18637/jss.v051.i05
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v051i05/v51i05.pdf
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v051i05/tm.plugin.dc_0.2-4.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v051i05/v51i05.R
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v051i05/v51i05-data.tar.bz2
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v051.i05?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Paul C. Tetlock, 2007. "Giving Content to Investor Sentiment: The Role of Media in the Stock Market," Journal of Finance, American Finance Association, vol. 62(3), pages 1139-1168, June.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Wittek, Peter & Gao, Shi Chao & Lim, Ik Soo & Zhao, Li, 2017. "somoclu: An Efficient Parallel Library for Self-Organizing Maps," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 78(i09).
    2. Lukas Borke & Wolfgang K. Härdle, 2016. "Q3-D3-Lsa," SFB 649 Discussion Papers SFB649DP2016-049, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Müller, Karsten, 2020. "German forecasters' narratives: How informative are German business cycle forecast reports?," Working Papers 23, German Research Foundation's Priority Programme 1859 "Experience and Expectation. Historical Foundations of Economic Behaviour", Humboldt University Berlin.
    2. Goedde-Menke, Michael & Langer, Thomas & Pfingsten, Andreas, 2014. "Impact of the financial crisis on bank run risk – Danger of the days after," Journal of Banking & Finance, Elsevier, vol. 40(C), pages 522-533.
    3. David E. Allen & Michael McAleer & Abhay K. Singh, 2019. "Daily market news sentiment and stock prices," Applied Economics, Taylor & Francis Journals, vol. 51(30), pages 3212-3235, June.
    4. Yan Luo & Linying Zhou, 2020. "Textual tone in corporate financial disclosures: a survey of the literature," International Journal of Disclosure and Governance, Palgrave Macmillan, vol. 17(2), pages 101-110, September.
    5. Jiao Ji & Oleksandr Talavera & Shuxing Yin, 2018. "The Hidden Information Content: Evidence from the Tone of Independent Director Reports," Working Papers 2018-28, Swansea University, School of Management.
    6. Lixiang Wang & Wendi Hou & Yupei Liu, 2023. "How do co‐shareholding networks affect negative media coverage? Evidence from China," Accounting and Finance, Accounting and Finance Association of Australia and New Zealand, vol. 63(4), pages 4221-4249, December.
    7. Kamaladdin Fataliyev & Aneesh Chivukula & Mukesh Prasad & Wei Liu, 2021. "Stock Market Analysis with Text Data: A Review," Papers 2106.12985, arXiv.org, revised Jul 2021.
    8. Bennani, Hamza, 2018. "Media coverage and ECB policy-making: Evidence from an augmented Taylor rule," Journal of Macroeconomics, Elsevier, vol. 57(C), pages 26-38.
    9. Christopher N. Avery & Judith A. Chevalier & Richard J. Zeckhauser, 2016. "The "CAPS" Prediction System and Stock Market Returns," Review of Finance, European Finance Association, vol. 20(4), pages 1363-1381.
    10. Keval Amin & Erica Harris, 2022. "The Effect of Investor Sentiment on Nonprofit Donations," Journal of Business Ethics, Springer, vol. 175(2), pages 427-450, January.
    11. Femg, Xunan & Johansson, Anders C., 2019. "News or Noise? The Information Content of Social Media in China," Stockholm School of Economics Asia Working Paper Series 2019-52, Stockholm School of Economics, Stockholm China Economic Research Institute.
    12. King, Timothy & Srivastav, Abhishek & Williams, Jonathan, 2016. "What's in an education? Implications of CEO education for bank performance," Journal of Corporate Finance, Elsevier, vol. 37(C), pages 287-308.
    13. Kirtac, Kemal & Germano, Guido, 2024. "Sentiment trading with large language models," Finance Research Letters, Elsevier, vol. 62(PB).
    14. André Betzer & Jan Philipp Harries, 2022. "How online discussion board activity affects stock trading: the case of GameStop," Financial Markets and Portfolio Management, Springer;Swiss Society for Financial Market Research, vol. 36(4), pages 443-472, December.
    15. Dirk Ulbricht & Konstantin A. Kholodilin & Tobias Thomas, 2017. "Do Media Data Help to Predict German Industrial Production?," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 36(5), pages 483-496, August.
    16. Sapkota, Niranjan, 2022. "News-based sentiment and bitcoin volatility," International Review of Financial Analysis, Elsevier, vol. 82(C).
    17. Boniface Yemba & Yi Duan & Nabaneeta Biswas, 2023. "Government spending news and stock price index," Economics Bulletin, AccessEcon, vol. 43(4), pages 1816-1841.
    18. David Bholat & Stephen Hans & Pedro Santos & Cheryl Schonhardt-Bailey, 2015. "Text mining for central banks," Handbooks, Centre for Central Banking Studies, Bank of England, number 33, April.
    19. Bijl, Laurens & Kringhaug, Glenn & Molnár, Peter & Sandvik, Eirik, 2016. "Google searches and stock returns," International Review of Financial Analysis, Elsevier, vol. 45(C), pages 150-156.
    20. Chen, Cathy Yi-Hsuan & Fengler, Matthias R. & Härdle, Wolfgang Karl & Liu, Yanchu, 2022. "Media-expressed tone, option characteristics, and stock return predictability," Journal of Economic Dynamics and Control, Elsevier, vol. 134(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:051:i05. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.