Distributed dataflow processing of large RDF graphs
MetadataShow full item record
This item's downloads: 1491 (view details)
As part of the big data world, RDF, the graph-based data model of the Semantic Web, is growing in use. Consequently, the size of available RDF data is increasing and massive datasets are becoming commonplace. Nevertheless, when analysing large RDF datasets, users are left mainly with two options: using SPARQL, the main query language for RDF, or using an existing non-RDF-specific big data language. This thesis argues that each of these two approaches has its own limitations. SPARQL is costly to compute and complex analyses can be hard to express in a purely declarative SPARQL query. On the other hand, using existing big data languages designed for tabular data commonly results in verbose, unreadable, and sometimes inefficient scripts. This dissertation, therefore, pursues defining a dataflow language specifically designed to process large RDF data on top of distributed platforms. In developing a dataflow language, this dissertation discusses three components: (i) the data query language, (ii) the underlying data model, and (iii) the physical arrangement of the underlying distributed data. On data models, we introduce RDF.co, a data model that defines a pair of a binding and a graph in the value of each expression. Compared to the SPARQL algebra, RDF.co is fully composable. We provide a formal definition of the syntax and semantics of the data model, characterise its expressivity in comparison to SPARQL, and present a number of its unique algebraic properties. Algebraic properties of RDF.co represent a unique study on relations between triple patterns. These properties, when interpreted as rewriting rules, provide theoretical foundation needed to apply cost-based query optimisation. We present rules for triple patterns elimination, insertion, and pushing down. On physical arrangement of the underlying distributed data, this dissertation focuses on graph partitioning. We define an RDF graph partitioning approach using pattern matching. This use of pattern matching allows mapping query answering over partitioned graphs to the well-studied problem of view-based query answering. Our experiments show that using pattern matching to guide graph partitioning allows leveraging knowledge that might be available about the data or the task at hand to enhance query answering time. On data query languages, this dissertation describes SYRql, a dataflow language for large scale processing of RDF data. We describe an implementation of SYRql on top of the MapReduce platform. SYRql implementation utilises the underlying data model and the partitioning of the data. Our experiments show that SYRql performance for query answering is comparable to that of other well-established big data languages such as IBM Jaql and the Apache Pig Latin.