IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v076i14.html
   My bibliography  Save this article

Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R

Author

Listed:
  • Hahsler, Michael
  • Bolaños, Matthew
  • Forrest, John

Abstract

In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.

Suggested Citation

  • Hahsler, Michael & Bolaños, Matthew & Forrest, John, 2017. "Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 76(i14).
  • Handle: RePEc:jss:jstsof:v:076:i14
    DOI: http://hdl.handle.net/10.18637/jss.v076.i14
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v076i14/v76i14.pdf
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v076i14/stream_1.2-4.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v076i14/v76i14.R
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v076i14/kddcup.data.gz
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v076.i14?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Xie, Yihui, 2013. "animation: An R Package for Creating Animations and Demonstrating Statistical Methods," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 53(i01).
    2. Hahsler, Michael & Dunham, Margaret H., 2010. "rEMM: Extensible Markov Model for Data Stream Clustering in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 35(i05).
    3. Hornik, Kurt, 2005. "A CLUE for CLUster Ensembles," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 14(i12).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Krzysztof Gajowniczek & Marcin Bator & Tomasz Ząbkowski & Arkadiusz Orłowski & Chu Kiong Loo, 2020. "Simulation Study on the Electricity Data Streams Time Series Clustering," Energies, MDPI, vol. 13(4), pages 1-25, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marcin Pełka, 2012. "Ensemble approach for clustering of interval-valued symbolic data," Statistics in Transition new series, Główny Urząd Statystyczny (Polska), vol. 13(2), pages 335-342, June.
    2. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    3. Hornik, Kurt & Grün, Bettina, 2014. "movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 58(i10).
    4. Wei, Kun & Zhang, Youxin & Luo, Yi, 2018. "Variance-mediated multifractal analysis of group participation in chasing a single dangerous prey," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 503(C), pages 1275-1287.
    5. Luis Lorenzo & Javier Arroyo, 2022. "Analysis of the cryptocurrency market using different prototype-based clustering techniques," Financial Innovation, Springer;Southwestern University of Finance and Economics, vol. 8(1), pages 1-46, December.
    6. Varma, Jayanth R. & Virmani, Vineet, 2017. "Shiny Alternative for Finance in the Classroom," IIMA Working Papers WP 2017-03-05, Indian Institute of Management Ahmedabad, Research and Publication Department.
    7. Boztug, Yasemin & Reutterer, Thomas, 2008. "A combined approach for segment-specific market basket analysis," European Journal of Operational Research, Elsevier, vol. 187(1), pages 294-312, May.
    8. Hornik, Kurt & Feinerer, Ingo & Kober, Martin & Buchta, Christian, 2012. "Spherical k-Means Clustering," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 50(i10).
    9. Fionn Murtagh, 2009. "The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 26(3), pages 249-277, December.
    10. Brock, Guy & Pihur, Vasyl & Datta, Susmita & Datta, Somnath, 2008. "clValid: An R Package for Cluster Validation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i04).
    11. Pełka Marcin, 2018. "Analysis of Innovations in the European Union Via Ensemble Symbolic Density Clustering," Econometrics. Advances in Applied Data Analysis, Sciendo, vol. 22(3), pages 84-98, September.
    12. Apostolos Bozikas & Georgios Pitselis, 2018. "An Empirical Study on Stochastic Mortality Modelling under the Age-Period-Cohort Framework: The Case of Greece with Applications to Insurance Pricing," Risks, MDPI, vol. 6(2), pages 1-34, April.
    13. Fišar, Miloš & Greiner, Ben & Huber, Christoph & Katok, Elena & Ozkes, Ali & Management Science Reproducibility Collaboration, 2023. "Reproducibility in Management Science," Department for Strategy and Innovation Working Paper Series 03/2023, WU Vienna University of Economics and Business.
    14. Pełka Marcin, 2019. "Analysis of Happiness in EU Countries Using the Multi-Model Classification based on Models of Symbolic Data," Econometrics. Advances in Applied Data Analysis, Sciendo, vol. 23(3), pages 15-25, September.
    15. Axel Strauß & François Guilhaumon & Roger Daniel Randrianiaina & Katharina C Wollenberg Valero & Miguel Vences & Julian Glos, 2016. "Opposing Patterns of Seasonal Change in Functional and Phylogenetic Diversity of Tadpole Assemblages," PLOS ONE, Public Library of Science, vol. 11(3), pages 1-18, March.
    16. Meyer, Sebastian & Held, Leonhard & Höhle, Michael, 2017. "Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 77(i11).
    17. Thomas Reutterer & Kurt Hornik & Nicolas March & Kathrin Gruber, 2017. "A data mining framework for targeted category promotions," Journal of Business Economics, Springer, vol. 87(3), pages 337-358, April.
    18. Juan José Fernández-Durán & María Mercedes Gregorio-Domínguez, 2021. "Consumer Segmentation Based on Use Patterns," Journal of Classification, Springer;The Classification Society, vol. 38(1), pages 72-88, April.
    19. repec:jss:jstsof:25:i04 is not listed on IDEAS
    20. repec:jss:jstsof:25:i05 is not listed on IDEAS
    21. Pełka Marcin, 2019. "Assessment of the Development of the European Oecd Countries with the Application of Linear Ordering and Ensemble Clustering of Symbolic Data," Folia Oeconomica Stetinensia, Sciendo, vol. 19(2), pages 117-133, December.
    22. repec:hum:wpaper:sfb649dp2006-006 is not listed on IDEAS
    23. Linda Vidman & David Källberg & Patrik Rydén, 2019. "Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-21, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:076:i14. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.