Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora
MetadataShow full item record
This item's downloads: 6457 (view details)
The Web contains a vast amount of information on an abundance of topics, much of which is encoded as structured data indexed by local databases. However, these databases are rarely interconnected and information reuse across sites is limited. Semantic Web standards offer a possible solution in the form of an agreed-upon data model and set of syntaxes, as well as metalanguages for publishing schema-level information, offering a highly-interoperable means of publishing and interlinking structured data on the Web. Thanks to the Linked Data community, an unprecedented lode of such data has now been published on the Web -- by individuals, academia, communities, corporations and governmental organisations alike -- on a medley of often overlapping topics. This new publishing paradigm has opened up a range of new and interesting research topics with respect to how this emergent "Web of Data" can be harnessed and exploited by consumers. Indeed, although Semantic Web standards theoretically enable a high level of interoperability, heterogeneity still poses a significant obstacle when consuming this information: in particular, publishers may describe analogous information using different terminology, or may assign different identifiers to the same referents. Consumers must also overcome the classical challenges of processing Web data sourced from multitudinous and unvetted providers: primarily, scalability and noise. In this thesis, we look at tackling the problem of heterogeneity with respect to consuming large-scale corpora of Linked Data aggregated from millions of sources on the Web. As such, we design bespoke algorithms -- in particular, based on the Semantic Web standards and traditional Information Retrieval techniques -- which leverage the declarative schemata (a.k.a. terminology) and various statistical measures to help smooth out the heterogeneity of such Linked Data corpora in a scalable and robust manner. All of our methods are distributed over a cluster of commodity hardware, which typically allows for enhancing performance and/or scale by adding more machines. We first present a distributed crawler for collecting a generic Linked Data corpus from millions of sources; we perform an open crawl to acquire an evaluation corpus for our thesis, consisting of 1.118 billion facts of information collected from 3.985 million individual documents hosted by 783 different domains. Thereafter, we present our distributed algorithm for performing a links-based analysis of the data-sources (documents) comprising the corpus, where the resultant ranks are used in subsequent chapters as an indication of the importance and trustworthiness of the information they contain. Next, we look at custom techniques for performing rule-based materialisation, leveraging RDFS and OWL semantics to infer new information, often using mappings -- provided by the publishers themselves -- to translate between different terminologies. Thereafter, we present a formal framework for incorporating metainformation -- relating to trust, provenance and data-quality -- into this inferencing procedure; in particular, we derive and track ranking values for facts based on the sources they originate from, later using them to repair identified noise (logical inconsistencies) in the data. Finally, we look at two methods for consolidating coreferent identifiers in the corpus, and we present an approach for discovering and repairing incorrect coreference through analysis of inconsistencies. Throughout the thesis, we empirically demonstrate our methods against our real-world Linked Data corpus, and on a cluster of nine machines.