IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v14y2022i7p211-d864174.html
   My bibliography  Save this article

Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples

Author

Listed:
  • Daiho Uhm

    (Department of Mathematics, University of Arkansas—Fort Smith, Fort Smith, AR 72913, USA)

  • Sunghae Jun

    (Department of Big Data and Statistics, Cheongju University, Chungbuk 28503, Korea)

Abstract

Due to the expansion of the internet, we encounter various types of big data such as web documents or sensing data. Compared to traditional small data such as experimental samples, big data provide more chances to find hidden and novel patterns with big data analysis using statistics and machine learning algorithms. However, as the use of big data increases, problems also occur. One of them is a zero-inflated problem in structured data preprocessed from big data. Most count values are zeros because a specific word is found in only some documents. In particular, since most of the patent data are in the form of a text document, they are more affected by the zero-inflated problem. To solve this problem, we propose a generation of synthetic samples using statistical inference and tree structure. Using patent document and simulation data, we verify the performance and validity of our proposed method. In this paper, we focus on patent keyword analysis as text big data analysis, and we encounter the zero-inflated problem just like other text data.

Suggested Citation

  • Daiho Uhm & Sunghae Jun, 2022. "Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples," Future Internet, MDPI, vol. 14(7), pages 1-11, July.
  • Handle: RePEc:gam:jftint:v:14:y:2022:i:7:p:211-:d:864174
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/14/7/211/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/14/7/211/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Cameron,A. Colin & Trivedi,Pravin K., 2013. "Regression Analysis of Count Data," Cambridge Books, Cambridge University Press, number 9781107667273, September.
    2. Hilbe,Joseph M., 2014. "Modeling Count Data," Cambridge Books, Cambridge University Press, number 9781107611252, October.
    3. Vernic, Raluca, 2000. "A Multivariate Generalization of the Generalized Poisson Distribution," ASTIN Bulletin, Cambridge University Press, vol. 30(1), pages 57-67, May.
    4. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    5. Hei Chia Wang & Yung Chang Chi & Ping Lun Hsin, 2018. "Constructing Patent Maps Using Text Mining to Sustainably Detect Potential Technological Opportunities," Sustainability, MDPI, vol. 10(10), pages 1-18, October.
    6. Cindy Xin Feng, 2021. "A comparison of zero-inflated and hurdle models for modeling zero-inflated count data," Journal of Statistical Distributions and Applications, Springer, vol. 8(1), pages 1-19, December.
    7. Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).
    8. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Luiz Paulo Fávero & Joseph F. Hair & Rafael de Freitas Souza & Matheus Albergaria & Talles V. Brugni, 2021. "Zero-Inflated Generalized Linear Mixed Models: A Better Way to Understand Data Relationships," Mathematics, MDPI, vol. 9(10), pages 1-28, May.
    2. Moritz Berger & Gerhard Tutz, 2021. "Transition models for count data: a flexible alternative to fixed distribution models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(4), pages 1259-1283, October.
    3. de Rezende, Rafael & Egert, Katharina & Marin, Ignacio & Thompson, Guilherme, 2022. "A white-boxed ISSM approach to estimate uncertainty distributions of Walmart sales," International Journal of Forecasting, Elsevier, vol. 38(4), pages 1460-1467.
    4. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    5. Sinclair, Michael & Ghermandi, Andrea & Signorello, Giovanni & Giuffrida, Laura & De Salvo, Maria, 2022. "Valuing Recreation in Italy's Protected Areas Using Spatial Big Data," Ecological Economics, Elsevier, vol. 200(C).
    6. Brian Fogarty & David Kimball & Lea Kosnik, 2016. "The Media, Voter Fraud, and the 2012 Elections," Working Papers 1012, University of Missouri-St. Louis, Department of Economics.
    7. Tomáš Katrňák & Barbora Hubatková, 2022. "Does educational expansion decrease suicide rates in European countries? The compositional effect in educational stratification of suicides," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(3), pages 923-947, June.
    8. Andrés García-Echalar & Tomás Rau, 2020. "The Effects of Increasing Penalties in Drunk Driving Laws—Evidence from Chile," IJERPH, MDPI, vol. 17(21), pages 1-16, November.
    9. Sangsung Park & Sunghae Jun, 2017. "Technology Analysis of Global Smart Light Emitting Diode (LED) Development Using Patent Data," Sustainability, MDPI, vol. 9(8), pages 1-15, August.
    10. Lall, Ashish, 2018. "Delays in the New York City metroplex," Transportation Research Part A: Policy and Practice, Elsevier, vol. 114(PA), pages 139-153.
    11. Sunghae Jun, 2018. "Bayesian Count Data Modeling for Finding Technological Sustainability," Sustainability, MDPI, vol. 10(9), pages 1-12, September.
    12. Mutz, Rüdiger & Daniel, Hans-Dieter, 2018. "The bibliometric quotient (BQ), or how to measure a researcher’s performance capacity: A Bayesian Poisson Rasch model," Journal of Informetrics, Elsevier, vol. 12(4), pages 1282-1295.
    13. Chiara Bocci & Laura Grassini & Emilia Rocco, 2021. "A multiple inflated negative binomial hurdle regression model: analysis of the Italians’ tourism behaviour during the Great Recession," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(4), pages 1109-1133, October.
    14. Smith, David M. & Faddy, Malcolm J., 2016. "Mean and Variance Modeling of Under- and Overdispersed Count Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 69(i06).
    15. Mihaela COVRIG & Dumitru BADEA, 2017. "Some Generalized Linear Models for the Estimation of the Mean Frequency of Claims in Motor Insurance," ECONOMIC COMPUTATION AND ECONOMIC CYBERNETICS STUDIES AND RESEARCH, Faculty of Economic Cybernetics, Statistics and Informatics, vol. 51(4), pages 91-107.
    16. Fu, Xiaolan & Fu, Xiaoqing (Maggie) & Ghauri, Pervez & Hou, Jun, 2022. "International collaboration and innovation: Evidence from a leading Chinese multinational enterprise," Journal of World Business, Elsevier, vol. 57(4).
    17. Jun-You Lin, 2021. "Collaboration exploitation and exploration: does a proactive search strategy matter?," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8295-8329, October.
    18. Brendan P. M. McCabe & Christopher L. Skeels, 2020. "Distributions You Can Count On …But What’s the Point?," Econometrics, MDPI, vol. 8(1), pages 1-36, March.
    19. Riccardo (Jack) Lucchetti & Luca Pedini, 2020. "ParMA: Parallelised Bayesian Model Averaging for Generalised Linear Models," Working Papers 2020:28, Department of Economics, University of Venice "Ca' Foscari".
    20. Furman, Edward & Landsman, Zinoviy, 2010. "Multivariate Tweedie distributions and some related capital-at-risk analyses," Insurance: Mathematics and Economics, Elsevier, vol. 46(2), pages 351-361, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:14:y:2022:i:7:p:211-:d:864174. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.