Graph-based diffusion methods for node classification and link prediction
MetadataShow full item record
This item's downloads: 64 (view details)
Graphs are common representation tools to organize information from heterogeneous sources. They have been applied in various domains such as the scientific, engineering, political, and business sectors. The large volume of graph data is now becoming a surging research challenge for building highly accurate predictive models. One of the main motivations behind the interest in graphs is due to their structure, since it describes how information spreads through the nodes. In turn, this natural property of graphs makes them ideal for designing prediction models based on spreading functions, which can activate changes in the connections and its nodes. Link Prediction (LP) is one of the fundamental problems in graphs. It is the problem of predicting associations between a pair of nodes. Many LP algorithms addressed in the scientific literature evaluate them as a classification task or a ranking problem. Only a few studies have considered the impacts of diffusion in graphs, and none of the works on diffusion-based methods has investigated link prediction in graphs with heterogeneous nodes. Here, we proposed a 2-layered graph framework to combine two different graphs and analyzed diffusion algorithms such as PageRank, Katz, and heat diffusion model. We further proposed a novel diffusion method in a 2-layered graph framework by integrating matrix factorization with heat diffusion. Our results show the effectiveness of this integration by computing weighted (i.e., ranked) predictions of initially unknown links between two disjoint nodes. A prominent problem in the study of diffusion in graphs is that not all kinds of diffusion are the same. Some network structures support long-range diffusion, whereas others support short-range diffusion. Long-range diffusion has been largely explored in applications such as viral marketing, influence maximization, political campaigning, or ordinary label propagation, especially through graph-based semi-supervised machine learning. Conversely, short-range diffusion is less explored. If we consider the case of a real-world network where nodes are shared across different layers i.e., multiplex network, long-range diffusion might not work and result in node miss-classification. We proposed a novel boundary-based heat diffusion (BHD) method, which has the flexibility to diffuse the information step by step due to its time-dependent property and guarantees a closed-form solution. We evaluated BHD models on different types of real-world networks for a node classification task. We took the same inspiration from a semi-supervised machine learning approach by using a small amount of labeled data and a large number of unlabeled data to classify the nodes. For this task, we applied BHD in (i) multiplex network; (ii) homogeneous network with homophilic labels; (iii) homogeneous network with heterophilic labels; and (iv) homogeneous network with mixed labels. Furthermore, we extended our BHD approach in complex data for semi-supervised graph regression problems to predict the real values of the node labels using fewer labeled data. Experiments from business, biomedical, physical, and social domain data show that the boundary-based heat diffusion method can effectively outperform the top state of the art methods.