Author
Listed:
- Vassilis Plachouras
(Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece)
- Florent Carpentier
(Internet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, France)
- Muhammad Faheem
(CNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France)
- Julien Masanès
(Internet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, France)
- Thomas Risse
(Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany)
- Pierre Senellart
(CNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France)
- Patrick Siehndel
(Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany)
- Yannis Stavrakas
(Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece)
Abstract
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
Suggested Citation
Vassilis Plachouras & Florent Carpentier & Muhammad Faheem & Julien Masanès & Thomas Risse & Pierre Senellart & Patrick Siehndel & Yannis Stavrakas, 2014.
"ARCOMEM Crawling Architecture,"
Future Internet, MDPI, vol. 6(3), pages 1-24, August.
Handle:
RePEc:gam:jftint:v:6:y:2014:i:3:p:518-541:d:39354
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:6:y:2014:i:3:p:518-541:d:39354. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.