µRaptor: A DOM-based system with appetite for hCard elements

View/ Open
Date
2014Author
Muñoz, Emir
Costabello, Luca
Vandenbussche, Pierre-Yves
Metadata
Show full item recordUsage
This item's downloads: 308 (view details)
Recommended Citation
Muñoz, Emir, Costabello, Luca, & Vandenbussche, Pierre-Yves. (2014). µRaptor: a DOM-based system with appetite for hCard elements. Paper presented at the Proceedings of the Second International Conference on Linked Data for Information Extraction - Volume 1267, Riva del Garda, Italy.
Published Version
Abstract
This paper describes µRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. µRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show
0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.