IDEAS home Printed from https://ideas.repec.org/a/spr/ijsaem/v8y2017i2d10.1007_s13198-017-0629-1.html
   My bibliography  Save this article

Quantitative evaluation of web metrics for automatic genre classification of web pages

Author

Listed:
  • Ruchika Malhotra

    (Delhi Technological University)

  • Anjali Sharma

    (Dr K S Krishnan Marg)

Abstract

An additional dimension that facilitate a swift and relevant response from a web search engine is to introduce a genre class for each web page. The web genre classification distinguishes between pages by means of their features such as functionality, style, presentation layout, form and meta-content rather than on content. In this work, nineteen web metrics are identified according to the lexical, structural and functionality attributes of the web page rather than topic. The study is carried out to determine which of these attributes (lexical, structural and functionality) or its combinations, are significant for the development of web genre classification model. Also, we investigate the best web genre prediction model using parametric (Logistic Regression), non-parametric (Decision Tree) and ensemble (Bagging, Boosting) machine learning algorithms. We built forty-two genre classification models to classify web pages into Movie, TV or Music genre using a sample space data extracted from the Pixel Awards nominated and award winning websites. Our results obtained from the area under the curve analysis of these forty-two models show that the ensemble algorithms provide better performance. The rest of the models have acceptable performance, only in cases for which the lexical and structural attributes were fed in combination. Functionality metrics were found to considerably degrade the performance measure, irrespective of the algorithm used. The overall results of the study indicate the predictive capability of machine learning models for web genre classification, provided an appropriate choice is made on the selection of the input metrics.

Suggested Citation

  • Ruchika Malhotra & Anjali Sharma, 2017. "Quantitative evaluation of web metrics for automatic genre classification of web pages," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 8(2), pages 1567-1579, November.
  • Handle: RePEc:spr:ijsaem:v:8:y:2017:i:2:d:10.1007_s13198-017-0629-1
    DOI: 10.1007/s13198-017-0629-1
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s13198-017-0629-1
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s13198-017-0629-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Chaker Jebari, 2016. "Multi-Label Genre Classification of Web Pages Using an Adaptive Centroid-Based Classifier," Journal of Information & Knowledge Management (JIKM), World Scientific Publishing Co. Pte. Ltd., vol. 15(01), pages 1-21, March.
    2. Aidan Finn & Nicholas Kushmerick, 2006. "Learning to classify documents according to genre," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 57(11), pages 1506-1518, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jacques Savoy & Olena Zubaryeva, 2012. "Simple and efficient classification scheme based on specific vocabulary," Computational Management Science, Springer, vol. 9(3), pages 401-415, August.
    2. Hanan Al-Mofareji & Mahmoud Kamel & Mohamed Y. Dahab, 2017. "WeDoCWT: A New Method for Web Document Clustering Using Discrete Wavelet Transforms," Journal of Information & Knowledge Management (JIKM), World Scientific Publishing Co. Pte. Ltd., vol. 16(01), pages 1-19, March.
    3. Rutherford, Brian A., 2013. "A genre-theoretic approach to financial reporting research," The British Accounting Review, Elsevier, vol. 45(4), pages 297-310.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:ijsaem:v:8:y:2017:i:2:d:10.1007_s13198-017-0629-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.