IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v6y2021i7p71-d582700.html
   My bibliography  Save this article

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Author

Listed:
  • Gonçalo Carnaz

    (Informatics Departament, University of Évora, 7002-554 Évora, Portugal)

  • Mário Antunes

    (Computer Science and Communication Research Centre (CIIC), School of Technology and Management, Polytechnic of Leiria, 2411-901 Leiria, Portugal
    INESC TEC, CRACS, 4200-465 Porto, Portugal)

  • Vitor Beires Nogueira

    (Informatics Departament, University of Évora, 7002-554 Évora, Portugal)

Abstract

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808 , recall of 0.722 , and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Suggested Citation

  • Gonçalo Carnaz & Mário Antunes & Vitor Beires Nogueira, 2021. "An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing," Data, MDPI, vol. 6(7), pages 1-11, June.
  • Handle: RePEc:gam:jdataj:v:6:y:2021:i:7:p:71-:d:582700
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/6/7/71/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/6/7/71/
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Akira A. de Moura Galvão Uematsu & Anarosa A. F. Brandão, 2023. "eMailMe: A Method to Build Datasets of Corporate Emails in Portuguese," Data, MDPI, vol. 8(8), pages 1-12, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:6:y:2021:i:7:p:71-:d:582700. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.