Utilising Wikipedia for text mining applications

Qureshi, Muhammad Atif

dc.contributor.advisor	O'Riordan, Colm
dc.contributor.advisor	Pasi, Gabriella
dc.contributor.author	Qureshi, Muhammad Atif
dc.date.accessioned	2015-10-12T14:21:12Z
dc.date.available	2015-10-12T14:21:12Z
dc.date.issued	2015-10-08
dc.identifier.uri	http://hdl.handle.net/10379/5304
dc.description.abstract	The process whereby inferences are made from textual data is broadly referred to as text mining. In order to ensure the quality and effectiveness of the derived inferences, several approaches have been proposed for different text mining applications. Among these applications, classifying a piece of text into pre-defined classes through the utilisation of training data falls into supervised approaches while arranging related documents or terms into clusters falls into unsupervised approaches. In both these approaches, processing is undertaken at the level of documents to make sense of text within those documents. Recent research efforts have begun exploring the role of knowledge bases in solving the various problems that arise in the domain of text mining. Of all the knowledge bases, Wikipedia on account of being one of the largest human-curated, online encyclopaedia has proven to be one of the most valuable resources in dealing with various problems in the domain of text mining. However, previous Wikipedia-based research efforts have not taken both Wikipedia categories and Wikipedia articles together as a source of information. This thesis serves as a first step in eliminating this gap and throughout the contributions made in this thesis, we have shown the effectiveness of Wikipedia category-article structure for various text mining tasks. Wikipedia categories are organized in a taxonomical manner serving as semantic tags for Wikipedia articles and this provides a strong abstraction and expressive mode of knowledge representation. In this thesis, we explore the effectiveness of this mode of Wikipedia's expression (i.e., the category-article structure) via its application in the domains of text classification, subjectivity analysis (via a notion of ``perspective" in news search), and keyword extraction. First, we show the effectiveness of exploiting Wikipedia for two classification tasks i.e., 1- classifying the tweets being relevant/irrelevant to an entity or brand, 2- classifying the tweets into different topical dimensions such as tweets related with workplace, innovation, etc. To do so, we define the notion of \textit{relatedness} between the text in tweet and the information embedded within the Wikipedia category-article structure. Then, we present an application in the area of news search by using the same notion of \textit{relatedness} to show more information related to each search result highlighting the amount \textit{perspective} or subjective bias in each returned result towards a certain opinion, topical drift, etc. Finally, we present a keyword extraction strategy using community detection over the Wikipedia categories to discover related keywords arranged in different communities. The relationship between Wikipedia categories and articles is explored via a textual phrase matching framework whereby the starting point is textual phrases that match Wikipedia articles' titles/redirects. The Wikipedia articles for which a match occurs are then utilised by extraction of their associated categories, and these Wikipedia categories are used to derive various structural measures such as those relating to taxonomical depth and Wikipedia articles they contain. These measures are utilised in our proposed text classification, subjectivity analysis, and keyword extraction framework and the performance is analysed via extensive experimental evaluations. These experimental evaluations undertake comparisons with standard text mining approaches in the literature and our Wikipedia framework based on its category-article structure outperforms the standard text mining techniques.	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Ireland
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/3.0/ie/
dc.subject	Wikipedia	en_US
dc.subject	Category-article structure	en_US
dc.subject	Text mining	en_US
dc.subject	Information technology	en_US
dc.subject	Engineering & Informatics	en_US
dc.title	Utilising Wikipedia for text mining applications	en_US
dc.type	Thesis	en_US
dc.local.note	Making decisions from textual data is a complex undertaking requiring a manual labor. The processing ability of computers has made this task easy giving birth to the research field known as text mining. The work in this thesis aims towards improvement of text mining by utilising knowledge in Wikipedia.	en_US
dc.local.final	Yes	en_US
nui.item.downloads	2038

Files in this item

Name:: license.txt
Size:: 5.659Kb
Format:: Text file

View/Open

Name:: thesismain.pdf
Size:: 2.450Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

University of Galway Theses (PhD Theses)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland