Author
Listed:
- Valerio Sterzi
(BSE - Bordeaux sciences économiques - UB - Université de Bordeaux - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
- G.S. Ascione
(BSE - Bordeaux sciences économiques - UB - Université de Bordeaux - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
Abstract
The problem of disambiguation of company names poses a significant challenge in extracting useful information from patents. This issue biases research outcomes as it mostly underesti mates the number of patents attributed to companies, particularly multinational corporations which file patents under a plethora of names, including alternate spellings of the same en tity and, eventually, companies' subsidiaries.To date, addressing these challenges has relied on labor-intensive dictionary based or string matching approaches, leaving the problem of patents' assignee harmonization on large datasets mostly unresolved. To bridge this gap, this paper describes the Terrorizer algorithm, a text-based algorithm that leverages natural language processing (NLP), network theory, and rule-based techniques to harmonize the vari ants of company names recorded as patent assignees. In particular, the algorithm follows the tripartite structure of its antecedents, namely parsing, matching and filtering stage, adding an original "knowledge augmentation" phase which is used to enrich the information available on each assignee name. We use Terrorizer on a set of 325'917 companies' names who are as signees of patents granted by the USPTO from 2005 to 2022. The performance of Terrorizer is evaluated on four gold standard datasets. This validation step shows us two main things: the first is that the performance of Terrorizer is similar over different kind of datasets, prov ing that our algorithm generalizes well. Second, when comparing its performance with the one of the algorithm currently used in PatentsView for the same task (Monath et al., 2021), it achieves a higher F1 score. Finally, we use the Tree-structured Parzen Estimator (TPE) optimization algorithm for the hyperparameters' tuning. Our final result is a reduction in the initial set of names of over 42%.
Suggested Citation
Valerio Sterzi & G.S. Ascione, 2024.
"Presenting Terrorizer : an algorithm for consolidating company names in patent assignees,"
Working Papers
hal-04525318, HAL.
Handle:
RePEc:hal:wpaper:hal-04525318
Note: View the original document on HAL open archive server: https://hal.science/hal-04525318
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:wpaper:hal-04525318. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.