Show simple item record

dc.contributor.authorUmbrich, Jürgenen
dc.contributor.authorHarth, Andreasen
dc.contributor.authorHogan, Aidanen
dc.contributor.authorDecker, Stefanen
dc.identifier.citationJürgen Umbrich, Andreas Harth, Aidan Hogan, Stefan Decker "Four Heuristics to Guide Structured Content Crawling", 8th International Conference on Web Engineering (Short Paper), 2008.en
dc.description.abstractSearch engines focusing on particular media types face difficulties in discovering suitable URIs on the Web. Since the engines are only interested in a small fraction of the Web, a crawler should use heuristics to concentrate on that fraction. To devise such a heuristic, we postulate four hypotheses based on RFCs and W3C recommendations to find cues for certain content types. Tests on a corpus of 22m files (793GB content size) containing 630m URIs show that for the content types text, image, and application, the recommendations are mostly being followed, while results for audio and video are much less consistent. Our findings and recommendations can be implemented as heuristics for efficient discovery of structured content on the Web on top of existing crawlers.en
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 Ireland
dc.subject.lcshUniform Resource Identifiersen
dc.subject.lcshSearch enginesen
dc.titleFour Heuristics to Guide Structured Content Crawlingen
dc.typeConference Paperen
dc.local.publisherstatement©2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEEen
dc.contributor.funderScience Foundation Irelanden

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 Ireland
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland