Graph Analytics: the 3 C's and practical applications
The rave is about graph networks now. Thomson Reuters has a key initiative in Big Open Linked Data (BOLD) with DataFusion as a key RDF triples store. DataFusion functions as a data lake ingesting data from structured and unstructured sources, and linking them by PermIDs and predicates.
Once the graphs are set up, the next question begs - what are the analytical tools to derive insights from them?
There are 3 broad categories of graph analytical techniques with the following 'C' purposes.
1. Centralities - to identify root causes of items or events
2. Connectedness - to identify connections between 2 or more items
3. Clustering - to identify groups of communities that revolve around certain themes
There are various analytical techniques serving these purposes. These are described below.
1. PageRank :
This is the well-known web search algorithm named after Google's founder Larry Page. PageRank identifies the central or most influential website (or nodes in graphs) through the number of nodes that emanates or are emanated from it. It effectively solves a fixed point equilibrium matrix, and assigns importance scores to each node. An important node is characterized by having other important nodes referring to it. The technique belongs to the centrality group of analytical technique.
The figure above is that for a PageRank graph. It shows some key central nodes which 'act' as the sources of the impact.
Examples of use are in fraud detection or anti-money laundering (AML) in the financial industries. Now every analytical technique works on a premise. In this case, PageRank is useful for fraud detection or AML on the premise that a 'dubious' node tries to do 'too much' and is 'too connected'. For example, many AMLs trails end up in a node emanating numerous small payments.
2. Shortest path algorithm (most popular Dijkstra's Algorithm) :
Another technique is the shortest path algorithm. This seeks to find the shortest or most cost effective path between two nodes and belongs to the connectedness category of graph analytics. This is a classical problem and much studied in operations research.
A popular algorithm is Dijkstra's Algorithm.
In the figure above, the shortest path from B to A is computed from the various possible node paths. In DataFusion, plans are underway to weigh the relationships between entities, and compute the closest distance between the entities. This analytical technique is useful to find how a source entity (whether events, companies or persons) may impact another target entity. If this 'shortest distance' is 'long enough', its impact may be dissipated before reaching the target. To be used effectively, the link relationships need to be 'assessed' with appropriate weights for closer types of relationships.
3. K-means clustering
k-means clustering is an unsupervised learning technique to identify and cluster n nodes based on k features into a specified number of groups. It belongs intuitively to the clustering category. An example of its use is for financial portfolio recommendations in robo-advisory. A group of investors can be clustered according to their documented preferences or features like income, social groups, risk aversion etc. These clusters are then matched with suitable portfolio profiles. Another application is to identify node outliers for filtering terrorists/ fraud entities in KYC and compliance. This premises on the these entities having unusual or peculiar characteristics which are flagged out in the clustering.
4. K-nn graphs
A similar sounding technique to k-means clustering is k-nn graphs. Instead of a holistic view on all the graph nodes as in K-means, K-nn resolves the connectedness of a particular node. It does so by building the k-nearest neighbors with a certain node as a focus. Examples of algorithms include brute force graph building, NN-Descent graph building, fast online graph building, LSH SuperBit graph building and online graph building. This analytical technique can be used in financial applications to identify which (closest) nodes/ entities/ risk events are most likely to impact a certain node of interest. For example, suppose a APPL.O stock. A k-nn graph of neighbours is built around it indicating the most relevant news, entities, events etc that most likely impact its future returns.
5. Correlation clustering algorithm
A correlation clustering algorithm works on a Markov matrix and through a series of transition steps, identifies an equilibrium matrix. It is similar to k-means clustering but the number of clusters is not specified in advance. The clusters are chosen to either maximize agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimize disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). A use case is in credit risk management to identify correlated risk clusters. The identification of correlated risk clusters allows the identification of otherwise multiple abstract factors to economic factors relevant for risk analysis. Stress testing may then be done on these risk clusters.
In the figure above, the nodes represent portfolio positions which have clustered together based on industry factors, interest rate factors and Fama-French factor of Price to book ratio.
6. Bayesian network causality
Much has been written on Bayesian causality. In most graphs, the links represent correlations or relationships among entities. In Bayesian networks, causality is instead modeled through the application of Bayes laws for conditional probabilities. Some very interesting references include the Coherent Stress testing and the Portfolio management under stress: A Bayesian net approachfrom my respected Professor Ricardo Rebonato. Research has been done on credit risk in a networked economy from Profs Cossin and Schelhorn. This analytical tool can be categorised as a hybrid of the C's - centrality and connectedness. In many of the research work, the network effects are simulated in a connected manner. The most important node is frequently unearthed as part of the algorithm - centrality.
7. The bread and butter of graph analytics
The article is not complete without describing the bread and butter of graph analytics This includes:
1. Group aggregation queries over entities
2. First and nth depth level relationships
3. Links between two different types of entities
4. Partition of graphs - segmenting graphs for easy manipulation
Much of this is already achieved in the Data Explorer, the user interface of DataFusion. Some of these are also covered by SPARQL which is a query language much like SQL for relational databases. This allows for slice-and-dice operations like for aggregations - average, max and min operations.
8. Deep learning via Deep walk
A casual meeting with Adam Gibson, the author of the practical Deep Learning book led me to this. Deep learning can be applied to sparse graphical networks to make latent representations of the graphical relationships. This technique is presently used more in NLP and not so much in financial applications.
Many of the graph analytical techniques are developed based on networks. There is a subtle difference between networks and graphs, although in most cases, they are used synonymously. Networks are graphs, but graphs are not networks necessarily. Many network analytics rest on the solution of large sparse matrices. This will be touched upon in a separate blog post.