Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples

My bibliography Save this article

Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples

Author

Listed:

Daiho Uhm
(Department of Mathematics, University of Arkansas—Fort Smith, Fort Smith, AR 72913, USA)
Sunghae Jun
(Department of Big Data and Statistics, Cheongju University, Chungbuk 28503, Korea)

Registered:

Abstract

Due to the expansion of the internet, we encounter various types of big data such as web documents or sensing data. Compared to traditional small data such as experimental samples, big data provide more chances to find hidden and novel patterns with big data analysis using statistics and machine learning algorithms. However, as the use of big data increases, problems also occur. One of them is a zero-inflated problem in structured data preprocessed from big data. Most count values are zeros because a specific word is found in only some documents. In particular, since most of the patent data are in the form of a text document, they are more affected by the zero-inflated problem. To solve this problem, we propose a generation of synthetic samples using statistical inference and tree structure. Using patent document and simulation data, we verify the performance and validity of our proposed method. In this paper, we focus on patent keyword analysis as text big data analysis, and we encounter the zero-inflated problem just like other text data.

Suggested Citation

Daiho Uhm & Sunghae Jun, 2022. "Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples," Future Internet, MDPI, vol. 14(7), pages 1-11, July.

Handle: RePEc:gam:jftint:v:14:y:2022:i:7:p:211-:d:864174

Download full text from publisher

References listed on IDEAS

Cameron,A. Colin & Trivedi,Pravin K., 2013. "Regression Analysis of Count Data," Cambridge Books, Cambridge University Press, number 9781107667273, November.
- Cameron,A. Colin & Trivedi,Pravin K., 2013. "Regression Analysis of Count Data," Cambridge Books, Cambridge University Press, number 9781107014169, January.
Hilbe,Joseph M., 2014. "Modeling Count Data," Cambridge Books, Cambridge University Press, number 9781107028333, January.
- Hilbe,Joseph M., 2014. "Modeling Count Data," Cambridge Books, Cambridge University Press, number 9781107611252, January.
Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
Hei Chia Wang & Yung Chang Chi & Ping Lun Hsin, 2018. "Constructing Patent Maps Using Text Mining to Sustainably Detect Potential Technological Opportunities," Sustainability, MDPI, vol. 10(10), pages 1-18, October.
Cindy Xin Feng, 2021. "A comparison of zero-inflated and hurdle models for modeling zero-inflated count data," Journal of Statistical Distributions and Applications, Springer, vol. 8(1), pages 1-19, December.
Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).
Vernic, Raluca, 2000. "A Multivariate Generalization of the Generalized Poisson Distribution," ASTIN Bulletin, Cambridge University Press, vol. 30(1), pages 57-67, May.
Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Luiz Paulo Fávero & Joseph F. Hair & Rafael de Freitas Souza & Matheus Albergaria & Talles V. Brugni, 2021. "Zero-Inflated Generalized Linear Mixed Models: A Better Way to Understand Data Relationships," Mathematics, MDPI, vol. 9(10), pages 1-28, May.
Moritz Berger & Gerhard Tutz, 2021. "Transition models for count data: a flexible alternative to fixed distribution models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(4), pages 1259-1283, October.
de Rezende, Rafael & Egert, Katharina & Marin, Ignacio & Thompson, Guilherme, 2022. "A white-boxed ISSM approach to estimate uncertainty distributions of Walmart sales," International Journal of Forecasting, Elsevier, vol. 38(4), pages 1460-1467.
James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
Sinclair, Michael & Ghermandi, Andrea & Signorello, Giovanni & Giuffrida, Laura & De Salvo, Maria, 2022. "Valuing Recreation in Italy's Protected Areas Using Spatial Big Data," Ecological Economics, Elsevier, vol. 200(C).
Brian Fogarty & David Kimball & Lea Kosnik, 2016. "The Media, Voter Fraud, and the 2012 Elections," Working Papers 1012, University of Missouri-St. Louis, Department of Economics.
Tomáš Katrňák & Barbora Hubatková, 2022. "Does educational expansion decrease suicide rates in European countries? The compositional effect in educational stratification of suicides," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(3), pages 923-947, June.
Andrés García-Echalar & Tomás Rau, 2020. "The Effects of Increasing Penalties in Drunk Driving Laws—Evidence from Chile," IJERPH, MDPI, vol. 17(21), pages 1-16, November.
Sangsung Park & Sunghae Jun, 2017. "Technology Analysis of Global Smart Light Emitting Diode (LED) Development Using Patent Data," Sustainability, MDPI, vol. 9(8), pages 1-15, August.
Lall, Ashish, 2018. "Delays in the New York City metroplex," Transportation Research Part A: Policy and Practice, Elsevier, vol. 114(PA), pages 139-153.
Sunghae Jun, 2018. "Bayesian Count Data Modeling for Finding Technological Sustainability," Sustainability, MDPI, vol. 10(9), pages 1-12, September.
Mutz, Rüdiger & Daniel, Hans-Dieter, 2018. "The bibliometric quotient (BQ), or how to measure a researcher’s performance capacity: A Bayesian Poisson Rasch model," Journal of Informetrics, Elsevier, vol. 12(4), pages 1282-1295.
Chiara Bocci & Laura Grassini & Emilia Rocco, 2021. "A multiple inflated negative binomial hurdle regression model: analysis of the Italians’ tourism behaviour during the Great Recession," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(4), pages 1109-1133, October.
Smith, David M. & Faddy, Malcolm J., 2016. "Mean and Variance Modeling of Under- and Overdispersed Count Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 69(i06).
Li, Lisa & Shalaby, Amer, 2024. "Navigating the transit network: Understanding riders’ information seeking behavior using trip planning data," Transportation Research Part A: Policy and Practice, Elsevier, vol. 185(C).
Mihaela COVRIG & Dumitru BADEA, 2017. "Some Generalized Linear Models for the Estimation of the Mean Frequency of Claims in Motor Insurance," ECONOMIC COMPUTATION AND ECONOMIC CYBERNETICS STUDIES AND RESEARCH, Faculty of Economic Cybernetics, Statistics and Informatics, vol. 51(4), pages 91-107.
Fu, Xiaolan & Fu, Xiaoqing (Maggie) & Ghauri, Pervez & Hou, Jun, 2022. "International collaboration and innovation: Evidence from a leading Chinese multinational enterprise," Journal of World Business, Elsevier, vol. 57(4).
Jun-You Lin, 2021. "Collaboration exploitation and exploration: does a proactive search strategy matter?," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8295-8329, October.
Brendan P. M. McCabe & Christopher L. Skeels, 2020. "Distributions You Can Count On …But What’s the Point?," Econometrics, MDPI, vol. 8(1), pages 1-36, March.
Riccardo (Jack) Lucchetti & Luca Pedini, 2020. "ParMA: Parallelised Bayesian Model Averaging for Generalised Linear Models," Working Papers 2020:28, Department of Economics, University of Venice "Ca' Foscari".

More about this item

Keywords

zero-inflated data; synthetic sample; patent analysis; count data; classification and regression trees;
All these keywords.

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:14:y:2022:i:7:p:211-:d:864174. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data