IDEAS home Printed from https://ideas.repec.org/a/baq/taprar/v4y2022i2p6-13.html
   My bibliography  Save this article

Analysis of machine learning methods in the task of searching duplicates in the software code

Author

Listed:
  • Tetiana Kaliuzhna

    (National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»)

  • Yevhenii Kubiuk

    (National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»)

Abstract

The object of the study is code in the Python programming language, analyzed by machine learning methods to identify clones. This work is devoted to the study of machine learning methods and implementation of the decision tree machine learning model in the problem of finding clones in the program code. The paper also analyzes existing machine learning approaches for detecting duplicates in program code. During the comparison, the advantages and disadvantages of each algorithm were determined, and the results were summarized in the corresponding comparison tables. As a result of the analysis, it was determined that the method based on the decision tree, which gives the best result in the task of finding clones in the program code, is the most optimal both from the point of view of accuracy and from the point of view of implementation. The result of the work is a created model that, with an accuracy of more than 99 %, classifies cloned and non-cloned codes on an automatically generated dataset in a minimal amount of time. This system has several open questions for future research, the list of which is presented in this work. The proposed model has the following ways of further development: – recognition of clones rewritten from one programming language to another; – detection of vulnerabilities in the code; – improvement of model performance by creating more universal datasets. The perspective of the work lies in training a decision tree model for accurate and fast detection of code clones, which can potentially be widely used for plagiarism detection in both educational institutions and IT companies.

Suggested Citation

  • Tetiana Kaliuzhna & Yevhenii Kubiuk, 2022. "Analysis of machine learning methods in the task of searching duplicates in the software code," Technology audit and production reserves, PC TECHNOLOGY CENTER, vol. 4(2(66)), pages 6-13, August.
  • Handle: RePEc:baq:taprar:v:4:y:2022:i:2:p:6-13
    DOI: 10.15587/2706-5448.2022.263235
    as

    Download full text from publisher

    File URL: https://journals.uran.ua/tarp/article/view/263235/260162
    Download Restriction: no

    File URL: https://libkey.io/10.15587/2706-5448.2022.263235?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:baq:taprar:v:4:y:2022:i:2:p:6-13. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Iryna Prudius (email available below). General contact details of provider: https://journals.uran.ua/tarp/issue/archive .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.