A deep learning approach to genomics data for population scale clustering and ethnicity prediction
Date
2017-05-28Author
Karim, Md. Rezaul
Zappa, Achille
Sahay, Ratnesh
Rebholz-Schuhmann, Dietrich
Metadata
Show full item recordUsage
This item's downloads: 173 (view details)
Recommended Citation
Karim, Md. Rezaul , Zappa, Achille , Sahay, Ratnesh , & Rebholz-Schuhmann, Dietrich (2017). A Deep Learning Approach to Genomics Data for Population Scale Clustering and Ethnicity Prediction. Paper presented at the Proceedings of the ESWC workshop on Semantic Web solutions for large-scale biomedical data analytics (SeWeBMeDA), Portoroz, Slovenia, May 28, 2017.
Published Version
Abstract
The understanding of variations in genome sequences assists us in identifying
people who are predisposed to common diseases, solving rare diseases, and finding
corresponding population group of the individuals from a larger population group.
Although classical machine learning techniques allow the researchers to identify groups
or clusters of related variables, accuracies, and effectiveness of these methods diminish
for large and hyperdimensional datasets such as whole human genome. On the other hand,
deep learning (DL) can make better representations of large-scale datasets to build models
to learn these representations very extensively. Furthermore, Semantic Web (SW)
technologies already acted as useful adaptors in life science research for large-scale data
integration and querying. Thus the standardized public data created using SW plays an
increasingly important role in life sciences research. In this paper, we propose a novel and
scalable genomic data analysis towards population scale clustering and predicting
geographic ethnicity using SW and DL-based technique. We used genotypes data from
the 1000 Genome Project resulting from the whole genomes sequencing extracted from
the 2504 individuals consisting of 84 million variants with 26 ethnic origins.
Experimental results in terms accuracy and scalability show the effectiveness and
superiority compared to the state-of-the-art. Particularly, our deep-learning-based
analytics technique using classification and clustering algorithms can predict and group
targeted populations with a prediction accuracy of 98% and an ARI of 0.92 respectively.