Show simple item record

dc.contributor.authorUmbrich, Jürgenen
dc.contributor.authorKarnstedt, Marcelen
dc.contributor.authorHarth, Andreasen
dc.identifier.citationJürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Scalable Pattern Mining for Media-Type Focused Crawling", KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009.en
dc.description.abstractSearch engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occurring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.en
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 Ireland
dc.subjectData miningen
dc.titleFast and Scalable Pattern Mining for Media-Type Focused Crawlingen
dc.typeWorkshop paperen
dc.contributor.funderScience Foundation Irelanden

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 Ireland
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland