Efficient path query approaches over distributed linked data
MetadataShow full item record
This item's downloads: 99 (view details)
In the past decade, graphs have become the defacto data model in some of the popular application domains such as bioinformatics, social, and road networks. One of the most used variants of the graph model is the labeled graph where vertices and edges are given names. In the con- text of Semantic Web technologies, this graph representation is called Linked Data, expressed using RDF (Resource Description Format), where entities are given their names with Uniform Resource Identifiers (URIs) which uniquely identify each entity. A plethora of such information, corresponding to various domains, exists over the Web and is exposed as knowledge bases. Some of the prominent examples of such knowledge bases or datasets are YAGO, DBPedia and Bio2RDF. Things or entities described within these datasets can be interconnected within a single graph or these connections may lead to different datasets. This large collection of interlinked and distributed data over the web has created new chal- lenges in the context of data management, data integration and knowledge discovery. One of the key challenges is querying and analysing relevant information from such large-scale knowledge bases. For instance, in the context of graph analysis, navigational queries are at the basis of core tasks and processes. Several approaches have been proposed to perform path queries, with some supporting declarative queries, while others are based on imperative constructs. More- over, some approaches work in a distributed manner while others in a central way. For instance, SPARQL1.1, which is a declarative query language, introduced a feature called Property Paths (PPs). Using PPs, a user can check the existence of paths between two entities within a single graph. However, this variant is not very expressive, does not capture some of the important properties of linked data, and has performance issues when dealing with large-scale data. In order to exploit the full potential of linked data, the standard declarative query language of SPARQL needs (1) to be equipped with expressive features to capture some important graph analytical tasks and (2) to perform better even for large-scale graphs. We propose an extension to SPARQL1.1 property paths which we believe complements the current version of PPs with analytical features in a more expressive and robust manner. The algorithm proposed in our extension is a relatively straightforward solution but performs efficiently compared to tailored solutions in the literature. Our solution allows users to compute k shortest paths matching a property expression between two nodes. While the constant growth of linked data on the web in various domains has led to a challenging environment for data processing, it has also justified the need for novel approaches to handle and analyse such big data. Ecosystems such as Hadoop or Pregel have become defacto standards for such tasks and are capable of navigating the path queries in a distributed manner. However, these systems are developed for customised data distribution within a controlled environment, and in the context of Linked Data such systems are not considered purely heterogeneous. In our second major contribution in this thesis, considering the limitations and management efforts to handle homogeneous systems, we propose different solutions which we believe are easy to adopt and scalable. They do not require complicated management settings when the data is either customized in a controlled environment or even if the data is accessible publicly in a heterogeneous environment. Our first implementation towards distributed path query processing is FedS, which we pro- pose as a P2P solution. However, our solution in contrast with the core P2P architecture is of a hybrid type. Generally, queries in P2P networks are blindly forwarded from node to node, and also a loose guarantee of resource discovery in P2P network can make it possible not to find a resource node although it does exist. We introduced the concept of source selection within the P2P network, which reduces the number of query requests by selecting the relevant source nodes. Our second implementation for distributed architecture is QPPDS, which is an index-based path query engine with the concept of finding paths within a heterogeneous environment. We proposed a source selection mechanism based on a pre-computed index. We also provide a dataset, exposed as a SPARQL endpoint, for the paths calculated by QPPDS. This dataset contains the statistical and provenance information for those paths. We provide a comprehensive evaluation of this approach on real-world life sciences data with different execution strategies. Our third implementation is a cache-based and index-free distributed path query engine called DpcLD, which computes the paths distributed within a homogeneous environment. We provide extensive evaluations on both synthetic and real-world data. We evaluate DpcLD not only against the distributed approaches but also do a comparison with the centralized approach and show trade-offs in different contexts. In this thesis, we proposed different approaches to address the pathfinding task, especially in a distributed environment. From the individual contributions described above, the overall conclusion of this work is that graph management and distribution have a strong impact on the efficiency and accuracy of pathfinding, and therefore that different solutions are required depending on their characteristics.