IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0047216.html
   My bibliography  Save this article

A Hybrid Distance Measure for Clustering Expressed Sequence Tags Originating from the Same Gene Family

Author

Listed:
  • Keng-Hoong Ng
  • Chin-Kuan Ho
  • Somnuk Phon-Amnuaisuk

Abstract

Background: Clustering is a key step in the processing of Expressed Sequence Tags (ESTs). The primary goal of clustering is to put ESTs from the same transcript of a single gene into a unique cluster. Recent EST clustering algorithms mostly adopt the alignment-free distance measures, where they tend to yield acceptable clustering accuracies with reasonable computational time. Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes. Methodology/Principal Findings: We propose a hybrid distance measure that combines the global and local features extracted from ESTs, with the aim to address the clustering problem faced by ESTs derived from the same gene family. The clustering process is implemented using the DBSCAN algorithm. We test the hybrid distance measure on the ten EST datasets, and the clustering results are compared with the two alignment-free EST clustering tools, i.e. wcd and PEACE. The clustering results indicate that the proposed hybrid distance measure performs relatively better (in terms of clustering accuracy) than both EST clustering tools. Conclusions/Significance: The clustering results provide support for the effectiveness of the proposed hybrid distance measure in solving the clustering problem for ESTs that originate from the same gene family. The improvement of clustering accuracies on the experimental datasets has supported the claim that the sensitivity of the hybrid distance measure is sufficient to solve the clustering problem.

Suggested Citation

  • Keng-Hoong Ng & Chin-Kuan Ho & Somnuk Phon-Amnuaisuk, 2012. "A Hybrid Distance Measure for Clustering Expressed Sequence Tags Originating from the Same Gene Family," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-14, October.
  • Handle: RePEc:plo:pone00:0047216
    DOI: 10.1371/journal.pone.0047216
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0047216
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0047216&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0047216?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Qiang Yang & Xindong Wu, 2006. "10 Challenging Problems In Data Mining Research," International Journal of Information Technology & Decision Making (IJITDM), World Scientific Publishing Co. Pte. Ltd., vol. 5(04), pages 597-604.
    2. Tiee-Jian Wu & Ya-Ching Hsieh & Lung-An Li, 2001. "Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition," Biometrics, The International Biometric Society, vol. 57(2), pages 441-448, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Qian, Kun & Luan, Yihui, 2017. "Weighted measures based on maximizing deviation for alignment-free sequence comparison," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 481(C), pages 235-242.
    2. DE CNUDDE, Sofie & MARTENS, David & EVGENIOU, Theodoros & PROVOST, Foster, 2017. "A benchmarking study of classification techniques for behavioral data," Working Papers 2017005, University of Antwerp, Faculty of Business and Economics.
    3. Harshita Patel & Dharmendra Singh Rajput & G Thippa Reddy & Celestine Iwendi & Ali Kashif Bashir & Ohyun Jo, 2020. "A review on classification of imbalanced data for wireless sensor networks," International Journal of Distributed Sensor Networks, , vol. 16(4), pages 15501477209, April.
    4. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 2018. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 20(2), pages 401-416, April.
    5. Liao, Jui-Jung & Shih, Ching-Hui & Chen, Tai-Feng & Hsu, Ming-Fu, 2014. "An ensemble-based model for two-class imbalanced financial problem," Economic Modelling, Elsevier, vol. 37(C), pages 175-183.
    6. Vilém Novák & Soheyla Mirshahi, 2021. "On the Similarity and Dependence of Time Series," Mathematics, MDPI, vol. 9(5), pages 1-14, March.
    7. Riesgo García, María Victoria & Krzemień, Alicja & Manzanedo del Campo, Miguel Ángel & Escanciano García-Miranda, Carmen & Sánchez Lasheras, Fernando, 2018. "Rare earth elements price forecasting by means of transgenic time series developed with ARIMA models," Resources Policy, Elsevier, vol. 59(C), pages 95-102.
    8. Pancheng Wang & Shasha Li & Haifang Zhou & Jintao Tang & Ting Wang, 2019. "Cited text spans identification with an improved balanced ensemble model," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1111-1145, September.
    9. Peter B. Gilbert & Chunyuan Wu & David V. Jobes, 2008. "Genome Scanning Tests for Comparing Amino Acid Sequences Between Groups," Biometrics, The International Biometric Society, vol. 64(1), pages 198-207, March.
    10. Ionuţ ŢĂRANU, 2016. "Data mining in healthcare: decision making and precision," Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, vol. 6(4), pages 33-40, May.
    11. repec:jss:jstsof:35:i05 is not listed on IDEAS
    12. Li, Hailin, 2017. "Distance measure with improved lower bound for multivariate time series," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 468(C), pages 622-637.
    13. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 0. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
    14. Hady Suryono & Heri Kuswanto & Nur Iriawan, 2022. "Two-Phase Stratified Random Forest for Paddy Growth Phase Classification: A Case of Imbalanced Data," Sustainability, MDPI, vol. 14(22), pages 1-13, November.
    15. Yan Li & Manoj Thomas & Kweku-Muata Osei-Bryson & Jason Levy, 2016. "Problem Formulation in Knowledge Discovery via Data Analytics (KDDA) for Environmental Risk Management," IJERPH, MDPI, vol. 13(12), pages 1-17, December.
    16. Neda Abdelhamid & Arun Padmavathy & David Peebles & Fadi Thabtah & Daymond Goulder-Horobin, 2020. "Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study," Journal of Information & Knowledge Management (JIKM), World Scientific Publishing Co. Pte. Ltd., vol. 19(01), pages 1-16, March.
    17. Marcello D’Agostino & Valentino Dardanoni, 2009. "What’s so special about Euclidean distance?," Social Choice and Welfare, Springer;The Society for Social Choice and Welfare, vol. 33(2), pages 211-233, August.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0047216. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.