Distributed dataflow processing of large RDF graphs

View/ Open
Date
2017-05-29Author
Maali, Fadi
Metadata
Show full item recordUsage
This item's downloads: 1491 (view details)
Abstract
As part of the big data world, RDF, the graph-based data model of the Semantic
Web, is growing in use. Consequently, the size of available RDF data is increasing
and massive datasets are becoming commonplace. Nevertheless, when analysing
large RDF datasets, users are left mainly with two options: using SPARQL, the
main query language for RDF, or using an existing non-RDF-specific big data
language. This thesis argues that each of these two approaches has its own limitations.
SPARQL is costly to compute and complex analyses can be hard to
express in a purely declarative SPARQL query. On the other hand, using existing
big data languages designed for tabular data commonly results in verbose, unreadable,
and sometimes inefficient scripts. This dissertation, therefore, pursues
defining a dataflow language specifically designed to process large RDF data on
top of distributed platforms.
In developing a dataflow language, this dissertation discusses three components:
(i) the data query language, (ii) the underlying data model, and (iii) the physical
arrangement of the underlying distributed data.
On data models, we introduce RDF.co, a data model that defines a pair of a
binding and a graph in the value of each expression. Compared to the SPARQL
algebra, RDF.co is fully composable. We provide a formal definition of the syntax
and semantics of the data model, characterise its expressivity in comparison
to SPARQL, and present a number of its unique algebraic properties. Algebraic
properties of RDF.co represent a unique study on relations between triple patterns.
These properties, when interpreted as rewriting rules, provide theoretical
foundation needed to apply cost-based query optimisation. We present rules for
triple patterns elimination, insertion, and pushing down.
On physical arrangement of the underlying distributed data, this dissertation focuses
on graph partitioning. We define an RDF graph partitioning approach using
pattern matching. This use of pattern matching allows mapping query answering
over partitioned graphs to the well-studied problem of view-based query answering.
Our experiments show that using pattern matching to guide graph partitioning
allows leveraging knowledge that might be available about the data or the task at
hand to enhance query answering time.
On data query languages, this dissertation describes SYRql, a dataflow language
for large scale processing of RDF data. We describe an implementation of SYRql
on top of the MapReduce platform. SYRql implementation utilises the underlying
data model and the partitioning of the data. Our experiments show that SYRql
performance for query answering is comparable to that of other well-established
big data languages such as IBM Jaql and the Apache Pig Latin.