Methods for defining dynamic online communities and community detection in fast-paced social media streams
MetadataShow full item record
This item's downloads: 7712 (view details)
Microblogging social media focuses on fast open real-time communication using short messages between users and their followers. Twitter is currently one of the largest and widely known microblogging OSN in the world, with more than 330 million monthly active users as of December 2017. Moreover, an average of 500 million Tweets (short messages) per day are generated within the service. Microblogging social media generate large amounts of content and community finding techniques are a suitable alternative for organising it. However, a fundamental challenge in the community detection literature is the diversity for a definition of user community, which makes evaluating and interpreting algorithms difficult. Therefore, in this thesis, two types of user community definition are adopted and investigated for microblogging: functional and structural definitions. A functional community groups its users by a common independent social function, e.g. fans of the same football team, while in a structural community the members exclusively depend on their connectivity in a network, e.g. modularity. In this work, functional definitions are built and characterised to be used as user-labelled ground-truth using eight types of social functions from Twitter interaction networks. Afterwards, these ground-truth functional communities are evaluated -- in static and dynamic scenarios -- considering thirteen popular structural community definitions from the literature. The goodness, robustness and sensitivity of these structural community definitions for detecting the functional ground-truth under different perturbation strategies is investigated. The proposed evaluation is carried using five different Twitter datasets captured during diverse periods of time. The results of the study show that definitions based on internal and mixed connectivity, e.g. Triangle Participation Ratio, Fraction Over Median Degree or Conductance work best for the Twitter use case and are very robust. On the other hand, other scores such as Modularity are limited and do not perform well due to the sparsity and noise of microblogging. Furthermore, using user activity as basis to refine communities into their active hotspots further improves the performance of community detection in microblogging. It is demonstrated in this work that standard community detection algorithms are challenged by the fast-paced dynamics and link sparsity of microblogging data. Therefore, it is argued that temporal characteristics must be considered for community detection methods in microblogging.