Leveraging Wikipedia-based features for entity relatedness and recommendation
MetadataShow full item record
This item's downloads: 1637 (view details)
Entities such as people, locations, organizations play a key role in natural language understanding. Most of the approaches that deal with natural language processing tasks, require a method to measure the relatedness between such entities. Is "Tom Cruise" more related to "Brad Pitt" than "Steve Jobs'"? A human may easily provide their judgement by using common sense and their knowledge about these entities. However, a computer would require an immense amount of world knowledge to reason about the semantic relatedness between such entities. Moreover, a human has the ability of using their knowledge to understand an entity. On the other hand, a computer requires an algorithmic approach to process the background knowledge to do the same. Wikipedia is a great source to obtain background knowledge about millions of such entities. In this thesis, we introduce Wikipedia-based Distributional Semantics for Entity Relatedness (DiSER), which analyzes the semantics of an entity by its distribution in a high dimensional concept space derived from Wikipedia. DiSER measures the semantic relatedness between two entities by quantifying the distance between the corresponding high-dimensional vectors. The DiSER model is built by considering only the manually linked entities provided in a corpus such as Wikipedia. Thus, it provides an unambiguous and more accurate distributional vector for an entity comparing to existing approaches which do not distinguish between an entity and its textual surface form. We evaluate the approach on a benchmark dataset that contains relative entity relatedness scores for 420 entity pairs. DiSER improves the accuracy by more than 10\% on state of the art methods for computing entity relatedness. In order to provide a resource that can be used to obtain the related entities for a given entity, we construct a graph called Entity Relatedness Graph (EnRG), where nodes represent Wikipedia entities and the relatedness scores are represented by the edges. Wikipedia contains more than 4 million entities, which requires efficient computation of the relatedness scores between the corresponding 16 trillions of entity-pairs in a fully connected graph. We present the processing behind EnRG to efficiently compute the relatedness scores between Wikipedia entities. EnRG can be seen as an entity recommendation system similar to the entity explorer provided by commercial search engines. However, most of the current approaches make use of search engine specific features such as co-occurrence information in query logs and user-click logs. Therefore, only major companies that have a large user-base and associated activities, can build entity recommendation systems with the existing approaches. However, publicly available knowledge resources such as Wikipedia can also provide an immense amount of associativity information about millions of entities. We propose Wikipedia-based Features for Entity Recommendation (WiFER) that combines different features extracted from Wikipedia and DiSER based relatedness scores. We evaluate EnRG and WiFER on a dataset of 4.5K search queries where each query has around 10 related entities tagged by human experts. We investigate the contribution of different features and compare Wikipedia-based features with the ones extracted from proprietary data like query logs and user activities. Since DiSER provides relatedness scores between Wikipedia entities, it can be used to compute the pairwise similarity between concepts in a distributional concept space built over Wikipedia entities. On this basis, we present Non-orthogonal explicit semantic analysis (NESA) that improves over the existing text relatedness model by considering correlation between explicit concepts. We compare NESA with several WordNet-based relatedness measures and other distributional semantic models against different gold standard datasets of word and text relatedness. We perform experiments with different entity relatedness measures used in NESA, and show that NESA with DiSER outperforms state of the art approaches. In order to demonstrate the use cases of the work presented in this thesis, we present several applications including EnRG-UI which is an entity recommendation system. EnRG-UI provides different functionalities to users for exploring related information about their favourite topics. We use DBpedia and the Yago ontology to obtain the different filters and facets which can be used to narrow down the search in EnRG-UI. Further, we present an approach to perform Medical Concept Resolution (MCR) to find the most appropriate medical concept in the Unified Medical Language System (UMLS), for a specific natural language query (e.g. a diagnosis report). To rank the concept candidate, MCR calculates relatedness scores between the context around the mention in a query and the context in UMLS. We evaluate MCR on a gold standard dataset that contains 100 medical queries annotated by human experts, and show that MCR outperforms the state of the art methods. We also present a Cross-Lingual Natural Language Querying (CroNL) approach to retrieve answers from a structured knowledge base for a natural language query in another language than that of the knowledge base. CroNL uses a cross-lingual extension of our relatedness measure to calculate relatedness scores between terms appearing in the query and the properties in a knowledge base. We evaluate CroNL over 50 natural language queries in German. We show that our cross-lingual relatedness measure outperforms the automatic translation based methods, for cross-lingual NL-Querying over DBpedia.