Thumbnail for Maxim Panov: Link Prediction with Graph Neural Networks by ODS AI Ru

Maxim Panov: Link Prediction with Graph Neural Networks

ODS AI Ru

29m 56s3,436 words~18 min read
AI audio transcription
Transcript source

AI audio transcription

This transcript was generated from the video's audio because no usable YouTube caption track was available. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline
Pull quotes
[0:02]Dear colleagues, my name is Maxim Panov, and today I'm going to tell you about link prediction with graph neural networks.
[0:02]In particular, I will focus on some applications of the developed methods to banking area.
[0:02]This is joint work with Valentina Shumovskaia, Kirill Fedianin, Ivan Sukharev and Dmitry Berestnev.
[0:02]In the talk I will consider a standard graph, which I will represent by adjacency matrix.
Use this transcript
Related transcript hubs

[0:02]Dear colleagues, my name is Maxim Panov, and today I'm going to tell you about link prediction with graph neural networks. In particular, I will focus on some applications of the developed methods to banking area. This is joint work with Valentina Shumovskaia, Kirill Fedianin, Ivan Sukharev and Dmitry Berestnev. In the talk I will consider a standard graph, which I will represent by adjacency matrix. A, and an element of this matrix A J is a weight of an edge between vertices I and J. We also usually assume that some features for nodes exist, so some feature matrix X, for which each row XI is a representation of this vertex in some D dimensional space.

[0:53]There exist many common tasks of data analysis on graphs. Three most common probably are classification of a vertex of an edge of the whole graph, clustering or grouping of graph nodes into into some meaningful groups. And link prediction, this is a third problem which is usually considered and it targets the problem of detecting whether there is an edge between these two vertices or not. In this talk, I'm going to focus on so-called graph neural networks or neural networks designed to process graph data.

[1:40]In tabular data, for example, you have a representation of data points by fixed length vectors. In images, each pixel has the same number of neighbors. Graphs are slightly different, they have certain regularity of the structure by which I mean that any node can have any number of neighbors. So, the generalization of neural networks to graphs is slightly non-trivial. Currently, the standard approach for the construction of a layer of a graph network is so-called graph convolutional layer. If you have HI embedding vectors of node I and you consider some trainable weight matrices W0 and W1, then what graph convolutional layer does is simply sums the transformed representations for the neighboring nodes of the target node and adds some other transformation of a target node itself. After that you can do element-wise non-linearity for for the obtained vector, and it gives you the new transformed embedding over vector. However, graph convolutional networks have certain limitations. Importantly, you just average or sum embedding vectors over the neighboring nodes in the convolution operation. In the real world application, you might assume that your neighboring nodes might have unequal contribution to your target node, and it is desired to somehow incorporate such a knowledge into the convolution operation. The particular approach to do that was introduced in Iclear paper 2018 by Velickovic and Coors, where they suggested to add trainable coefficients Alpha IJ, which weigh the nodes of which weight basically the edges of a graph or alternatively which weights the neighbors of a target node and convolution operation. These coefficients are usually computed by a separate neural network which takes as an input embedding vectors or probably feature vectors for the pair of nodes.

[4:30]In classical applications and classical application of graph neural network, usually several convolutional or probably graph attention layers are stacked and then you obtain an embedding for each node as an output of your network. Then, depending on the problem you solve, you can do different operation with this embedding. If you consider node classification, you just take a particular node, apply soft max function to the representation of it and solve the classification tasks. If you want to classify the whole graph, you may, for example, average node embeddings for all the nodes in a graph. Finally, if you solve a link prediction problem, you, for example, may consider scalar product for the embedding of two target nodes and then map them to 01 segment by applying, for example, sigma function to give you a probability.

[5:37]Let us now switch to the particular project we are we were working on. So, this project about link prediction for the transactional data arises in the banking area. The project was performed in collaboration with Sberbank. Let us consider a graph where each node represents a client and an edge between two nodes exists if there was at least one transfer of money between two clients over some time period. On this illustration, I have numbers which represent a number of money transfers between the particular pair of clients. The bank is very interested to know which node have the strongest influence on the node zero as a target node. This is because the bank wants to know, for example, if some client fails to pay credit, whether it influences the ability of of its neighbors to also pay for the credit. Essentially, that's a very important task for a bank. So, looking on this plot we may assume that nodes four and five have the strongest influence of node zero because there were a lot of transfers between them. However, what about node six?

[7:09]It has many transfers with nodes four and five, but this node zero and six never transfer money to each other. However, it's still essential to assume that there exists some link between them. Basically, we want to understand how strong influence is the influence of each node on the target node and we want to reveal some hidden interactions like the one we just discussed. We were given by very interesting data. So, for each node, we had a time series of transactions, like you buy something by your card, so we had a time, amount and some other features for each transaction. And also we had an information about money transfers between pair of nodes like again currency, time, amount and so on. Of course, all the data was in depersonalized format. As a result, we had a very large graph of 86 millions of nodes and it had about four billion of edges. So, four billion pairs of clients had at least one transfer during five years. We understand that the problem we conceive is some kind of link prediction problem, but we need to somehow formalize it. So, informally, link prediction problem is usually stated as the problem of finding hidden links between nodes and probably you also may want to determine which connections are stronger than others. Usually, in the literature, validation is done for link prediction problem in the form of edge sampling. So, if you have some graph like this one, you may hide some edge, for example, this edge can be hidden, and consider a graph without it. And then, so you basically generate graphs with hidden edges and then try to predict them and if your model predicts this edges successfully, then you are doing good.

[9:43]However, for our project, more reasonable statement was different. We wanted basically to understand how stable our connections are. And that's why for the training data we considered a transactional graph over the year, for example. And then we can compute a transactional graph for next three months. And basically the goal, based on a historical data, to predict whether in the future, in the next three months, there will be at least one transaction between pair of nodes. We call such an approach out-of-time validation. There exist numerous classical approaches to doing link prediction in graphs. For example, you may just completely ignore the graph and consider just a pair of nodes, it has some features extracted, for example, from the transaction each client is doing, and just predict whether it will be a transfer between the pair of nodes in the future. Of course, it's not exactly about link prediction, but it's about the problem of finding whether there is a connection between a pair of clients in a bank. If we go to the standard link prediction approaches, you may consider some simple operations over the neighborhoods. For example, you may consider or compute how many neighbors are common for the pair of nodes and then use this score to rank all the nodes in a graph and try to find a specified threshold and if the score is more than some threshold then you report that there is a link. Also you may consider more complicated metrics like a ratio of common neighbors to all the neighbors to nodes have. There is also possible to consider some random walk approaches, like ones based on PageRank or you may look on probabilistic models such as, for example, preferential attachment.

[12:02]We did computations on the usage of these simple methods to our problem of link prediction. We considered both out-of-time validation and edge sampling.

[12:17]What we observed that two problems behave very differently, so that different methods score very different for out-of-time validation and for edge sampling. If you focus on out-of-time validations, so the validation which mostly consistent with the problem we considered then the best which we get is a simple method which called Adamik Adar statistic which gave us score of 0.65 approximately.

[12:59]However, classical models have many limitations. First of all, it's sort of hard to adapt them to to particular data, like you just have a formula, it either works for your data or not. So, you don't have much flexibility.

[13:22]Some of these methods are also not really scalable like approaches based on some probability methods. Finally, majority of these method don't leverage rich data we have. They consider fixed graph, they consider some neighborhood of it, but they don't look, for example, on the features you might have for your vertex or an edge.

[13:51]The natural approach can be to consider graph convolutional networks or graph attention network for each for link prediction. So, you may look on embeddings for each vertex, so let us let me remind you that in our case each vertex had a time series of transaction corresponding to it, and you may, for example, create some current neural network, process the corresponding time series and obtain some embedding of fixed dimension for this node. Then you can use any machine learning model for classification based on two vectors, for example, so you have a graph neural network, it processes embeddings, and finally outputs you two vectors, and then you can consider, for example, fully connected neural network or you may consider just a scalar product as I discussed previously, and and that's it. However, such an approach can't be used directly on really big graphs because if you process the whole graph with your graph neural network then it might be very costly. Of course, this problem was discussed in the literature before. One of the most popular approaches to construct graph neural network is so-called GraphSAGE. It's more like a framework than a model, and it suggests the following. So, first of all, you divide your graph in subgraphs. For example, if you consider some pair of nodes, you may consider a subgraph of certain width for these two nodes. And so, neighborhood subgraph, I mean. And you may train graph convolutional network which processes a subgraph and does the convolution operation in a certain way, by graph convolutional network or by graph attention layer. And so finally, as discussed before, you get embeddings for the target nodes and train some model on them.

[16:10]In the literature there was also suggested to do some modification, particularly for link prediction program. So, you do basically as in graph SAGE, so you divide in subgraph as well, but you also add structural labels as a feature. Structural labels basically look on the distance between two target nodes in a graph. Okay, it's slightly more complicated and you can look the details in the paper. Then you do graph convolutions as usual, but they also propose not just to consider the embeddings for two target nodes as in graph SAGE, but to consider so-called sort pooling. So, you have your subgraph, you have embeddings for each node in a subgraph and they suggest to sort these embeddings of these nodes by some sorting function and took top K of them and finally consider sort of a combination of convolutional and dense neural network over them. And this approach was successfully applied to a number of problems and you may look on the details of this article which was published at Nips.

[17:37]In our case, considering the prediction of future money transfers, we also need to consider several important peculiarities of the data. Most importantly, our graph is really huge, so we had to go we have to go for subgraphs and also it has a temporal nature and you need to process this time series both in edges and in in nodes in some way.

[18:12]To work with to work with large graph, we decided to follow the seal approach, but to simplify it, we also made a step back towards graph SAGE. So, instead of considering completed sort pooling operation, we again focused on two target nodes and their embeddings. And then we considered a very simple graph where we completely ignored all the time series corresponding to edges and basically just binarized edges, whether there was at least one transaction or not.

[19:00]And it already allowed us to improve. First of all, seal just plane seal as described in the original paper already gives a substantial improvement in rock ox score. But if you simplify it and consider the two seal approach, you get further improvement. So, basically, we think that it just helps to decrease over fitting.

[19:27]The next question is what can you do with with time series corresponding to edges? So, in fact in our data, we have a time series of transaction over the year and we want to predict the transaction in next three months. So, was will will be there at least one transaction or not? And what we suggest, we suggest to to take a simple recurrent neural network based on GRU and first train it to predict the probability of connections. So, you just consider the pair of nodes and transaction between them. You completely ignore all the graph and then you try to predict the probability of connection in the future. It's already gave us a very good model with ROC AUC score of 0.81. So, just looking of on time series of transaction between two nodes, you you make a much better model than you do with simple graph approaches. It's not very surprising because, of course, properties of this time series should be driving factor in our problem just because of its nature. However, it's interesting whether you can somehow combine recursive neural network and a graph and improve your results even further. And what we did, we decided to consider graph attention network but with very special attention mechanism. So, what you do, you process a time series between two nodes and your recursive neural networks output you a probability of connections. And we just reweight our graph, re-elevate each edge in this graph by this probability. So, if my recursive neural network output probability equal to 0.9, then I weight this edge by weight 0.9 and so on for other pairs of of nodes.

[21:54]So, finally, we have an input graph, we process it with recursive neural network, the pre-trained one, we get a weighted graph and then we apply seal on top of it.

[22:07]Importantly, you also may do additional training of all this pipeline together in order to recursive neural network further adapt to to the problem when it's also attached to a graph. And we obtained actually very good results.

[22:33]First of all, if you combine RNN and seal, you already have an improvement. But if you consider two seal, so simplified model, and then combined with recursive neural network, the results are very, very good. So, we get a ROC AUC score of almost 0.86.

[23:04]So, far our link prediction results and the developed approach were very promising. But we also were interested whether our connections which are revealed by the model are meaningful. And we decided to use the obtained link prediction model to improve the credit scoring.

[23:33]So, credit scoring is a problem of determining whether the person is going to pay for a credit in a future or is he or she going to fail it.

[23:51]And it can be solved by say recursive neural network trained on a transactional particular client or it can be also solved using a graph or finally, you may combine it with link prediction. And we suggested a very simple thing. So, you have an original graph, then we do a link prediction on it and again wait the graph with probabilities of a link between a particular pair of nodes. So, we have again sort of attention but based on link prediction. And then we apply a standard graph convolutional network on top of it to compute the score or probability for the credit scoring problem which is a binary classification problem. Here the result was not that impressive, but still we believe useful. So, just a recursive neural network, so the neural network which considers only a particular node and its features gave us ROC AUC of 0.85. If you apply a standard graph convolutional network, you get an improvement, but you can further improve if you combine a graph convolutional network with attention based on link prediction. Improvement may seem not very large, but if you multiplied by a number of clients in a bank, you may find out that money savings maybe might be very, very substantial. So, let me tell a few lessons learned from this study. First of all, you don't need to construct very complicated models, some more simple ones like two seal might be more useful and perform better than complicated models. If you have some very rich data, if you use it properly, then it helps. Also graphs are sparse and if you need to use some tools to process it which are adapted to it like we use pytorch geometric in our project and it worked amazingly. And also it's important to consider strong base lines, like we had a recursive neural network, which was simple, but strong enough and it outperformed majority of standard approaches straight away. This work can be found as a preprint on archive and it was just recently accepted to a journal track on data science and AI in Fintech at DCAA conference, Data Science and Advanced Analytics 2020. It will be further published also in a journal which corresponds to this journal track.

[27:10]On top of what was just discussed, I also want to discuss some things related to graph convolutional networks in general. First of all, the important problem currently is that graph neural networks are not really deep. They don't benefit from depths, so usually you have two, three, maximum four layers. In the other study we made with graph neural networks, we observe that, yes, you can consider three layers, but then your quality starts to degrade. And big challenge is to benefit from from deep structures in graph neural network. The second interesting thing is whether you need the non-linearity in graph neural networks. It was recently observed that if you consider just a linear network, so you consider several convolutions, but without any non-linearities, you may obtain the quality which is very comparable with best models. So, this SGC, simple graph convolutional network is on par with all the top approaches, but its training time is order of magnitude smaller. So, it's an open question whether you really need much of non-linearity and whether the benefits you obtain come from non-linearity in graph neural networks.

[28:42]Also, it is well known that attention mechanisms might be very brittle for graph convolutional networks. And it was also discussed at Newrips last year and construction of robust attention mechanisms is very interesting and important problem. And finally, it's very interesting whether you can do much theory for graph convolutional networks. Currently, it's all focused on so-called isomorphism tests, but it would be interesting to consider something more statistical in our opinion. So, to conclude, I would say that graph neural networks overall are very powerful and convenient instrument if you have attributes in your graph. And if you carefully work with your model structure, you can process very complex data. Still graph convolutional networks are not very well understood and I think there are many directions for future work in both practice and theory. So, thanks a lot for listening and if you have any question, you can write me via my email address. Thanks a lot and stay in touch.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript