Fast and Scalable Pattern Mining for Media-Type Focused Crawling
|dc.identifier.citation||Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Scalable Pattern Mining for Media-Type Focused Crawling", KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009.||en|
|dc.description.abstract||Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occurring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.||en|
|dc.title||Fast and Scalable Pattern Mining for Media-Type Focused Crawling||en|
|dc.contributor.funder||Science Foundation Ireland||en|
Files in this item
This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. Please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.
The following license files are associated with this item: