Fast and Scalable Pattern Mining for Media-Type Focused Crawling

Umbrich, Jürgen; Karnstedt, Marcel; Harth, Andreas

View/Open

kdml09_J.Umbrich_et_al.pdf (194.5Kb)

Date

2009

Author

Umbrich, Jürgen

Karnstedt, Marcel

Harth, Andreas

Metadata

Show full item record

Usage

This item's downloads: 429 (view details)

Recommended Citation

Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Scalable Pattern Mining for Media-Type Focused Crawling", KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009.

Abstract

Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occurring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.

URI

http://hdl.handle.net/10379/1121

Collections

Data Science Institute (Workshop Papers)

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland