Cataloguing and linking publicly available biomedical SPARQL endpoints for federation - addressing aPosteriori data integration

Hasnain, Syed Muhammad Ali

dc.contributor.advisor	Schuhmann, Dietrich Rebholz
dc.contributor.author	Hasnain, Syed Muhammad Ali
dc.date.accessioned	2017-05-15T13:20:49Z
dc.date.available	2017-05-15T13:20:49Z
dc.date.issued	2017-05-12
dc.identifier.uri	http://hdl.handle.net/10379/6518
dc.description.abstract	During recent years the increasing adoption of Open Data Initiatives and Lined Data principles have lead to the creation of a globally distributed space of Linked Data that covers various domains such as Government, Libraries, Life Sciences, Media, Geographic and Social web. Approaches that conceive this data space as a huge distributed data sources and enable an execution of declarative queries over this database hold an enormous potential; they allow users to benefit from a virtually unbounded set of up-to-date data. As a consequence, several research groups have started to study such approaches. The Life Sciences domain has been one of the early adopters of Linked Data, and at present a considerable portion of the Linked Open Data cloud is comprised of datasets from Life Sciences Linked Open Data, known as LS-LOD. Although the publication of datasets as RDF is a necessary step towards achieving unified querying of biological datasets, it is not enough to achieve the interoperability necessary to enable a query-able Web of Life Sciences data. This can be achieved either by “a priori integration”, by ensuring multiple datasets make use of the same vocabularies and ontologies, or, alternatively using “a posteriori integration”, which makes use of mapping rules that change the topology of graphs such that integrated queries become possible. “a posteriori integration”, in Biomedical and Life Science data sources is the topic of this thesis. This dissertation first provides an analysis of freely and openly available data sources (SPARQL endpoints). Public SPARQL endpoints were analysed with two considerations i. What is the content of a public SPARQL endpoint? and ii. How self descriptive are these endpoints? For analysing public SPARQL endpoints we defined a set of self descriptive SPARQL queries. After this analysis we introduce the notion, namely Autonomous Resource Discovery and Indexing (ARDI), for facilitating “a posteriori integration”, in Biomedical and Life Science data sources. In particular, we introduce a Cataloguing and Linking mechanism that enables us to formally query Biomedical and Life Sciences Linked Open Data on the World Wide Web (WWW). As of 31st March 2016, the ARDI consists of 263,731 triples representing 12,658 distinct classes, 1,792 distinct properties and 13,027 distinct Orphan classes catalogued from 137 public SPARQL endpoints. Based on these Cataloguing and Linking approaches, we propose BioFed which is a federated query processing engine for Life Sciences Linked Open Data. BioFed offers a single-point-of-access for distributed Life Science data which enables scientists to access the data from reliable sources without extensive expertise in SPARQL query formulation. BioFed federates SPARQL queries over more than 137 public SPARQL endpoints. After demonstrating ARDI and its practical applications, this dissertation focuses on presenting Linked Biomedical Dataspace (LBDS) that enables the semantically-enriched representation, exposure, interconnection, querying and browsing of Biomedical data and knowledge in a standardised and homogenised way. We provide three practical scenarios known as workflows for using proposed LBDS and also list the Lessons Learned and Recommendations for developing different components of LBDS as we believe our gained insights will be useful for LD practitioners and researchers working on the topics similar to those covered in this thesis.	en_IE
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Ireland
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/3.0/ie/
dc.subject	SPARQL	en_IE
dc.subject	Semantic web	en_IE
dc.subject	Linked open data	en_IE
dc.subject	Linked biomedical dataspace	en_IE
dc.subject	Federated query processing	en_IE
dc.subject	Biomedical data	en_IE
dc.subject	aPosteriori integration	en_IE
dc.title	Cataloguing and linking publicly available biomedical SPARQL endpoints for federation - addressing aPosteriori data integration	en_IE
dc.type	Thesis	en_IE
dc.local.note	SPARQL Endpoints Federation addressing “a posteriori integration” using mapping rules that change the topology of graphs such that integrated queries become possible in Biomedical and Life Science data sources.	en_IE
dc.local.final	Yes	en_IE
nui.item.downloads	17701

Files in this item

Name:: license.txt
Size:: 5.659Kb
Format:: Text file

View/Open

Name:: SyedMuhammadAliHasnain_PhD_The ...
Size:: 7.223Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

University of Galway Theses (PhD Theses)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland