ARAN - Access to Research at NUI Galway

Fast and Scalable Pattern Mining for Media-Type Focused Crawling

ARAN - Access to Research at NUI Galway

Show simple item record Umbrich, Jürgen en Karnstedt, Marcel en Harth, Andreas en 2010-05-24T13:50:26Z en 2010-05-24T13:50:26Z en 2009 en
dc.identifier.citation Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Scalable Pattern Mining for Media-Type Focused Crawling", KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009. en
dc.identifier.uri en
dc.description.abstract Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occurring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types. en
dc.format application/pdf en
dc.language.iso en en
dc.subject Data mining en
dc.subject DERI en
dc.title Fast and Scalable Pattern Mining for Media-Type Focused Crawling en
dc.type Workshop paper en
dc.description.peer-reviewed peer-reviewed en
dc.contributor.funder Clique en
dc.contributor.funder Science Foundation Ireland en

Files in this item

This item appears in the following Collection(s)

Show simple item record