Graphs are everywhere!

‘Graphs are everywhere!’ is the title of my first blogpost on my new site: DO IT analytics.nl

The reason for me to create this blog is to share the knowledge I gain during my career with anyone who’s interested. To inspire others to do more with analytics, and more importantly, to stimulate myself to take the time to explain my topics of interest. I want to share all kind of analytics techniques, such as machine learning, visualization, advanced predictive analytics, data engineering and data modeling. Big data or small data, at least with a big impact and a high business value. I want to share this with enthusiasts in computer science, colleagues, current and future clients and family & friends. Furthermore, I want to give others the possibility to write on my blog as a guest blogger, as long as it’s in line with analytics. Let’s do it!

So… Graphs, not charts, but graphs as in Graph Theory.no-chart-but-graph

Over the past year I worked with Graph databases and I want to explain a bit what the buzz is about. So… Graph DB’s are used to do super-fast recommendations for shopping, movies, friends suggestion on Facebook and LinkedIn. Also, it’s powerful enough to detect fraudulent transactions. The best known example is the Panama and Paradise papers, where they revealed all kind of tax heavens of the rich elite. Also, Google uses graphs for its famous PageRank algorithm. This is largely the reason Google became the biggest search engine in the World. They don’t just count the amount of words that match on a webpage. No, they also look at the network of webpages and how they are linked together. All pages get a reputation and that reputation is copied over from one webpage to the next webpage. You can imagine it as a chain or network of information. This network is extremely large and you have to store and organize the data differently to serve the best search results.

We propose PageRank, a method for computing a ranking for every web page based on the graph of the web. – Larry Page

The concept of a Graph is very simple, it solely consists of nodes and edges, that’s all! In the example of Google a node is a webpage. The edge is a link between webpages. In this case the edge has a direction. A page (x) can be linked to another page (y), but it’s not necessary linked back. So you can draw an arrow (edge) from page x to y. This is what we call a Directed Graph, where the direction of the edges are important. As you can see in the image above, there are 10 nodes, so in this example 10 webpages. If you’ve a closer look at node 2 you can see it has the most incoming links if you count the amount of links (edges). Not just the amount is important in PageRank but also the weight is important. In total it has 5 incoming links with a weight of 10, 20, 15, 5, and 15. If you sum it up you get a total of 65. None of the other nodes has this high importance. This must be the most important page in the network. Nowadays PageRank is not the only metric Google is using. There are many extra variables to determine the best search results: for example, the geographical distance, page loading speed, security in place, whether it’s a mobile friendly webpage and for news items the: age of the news.

As said, Graphs are everywhere. You have to remember that a graph is a chain of information which is linked to each other. Another good example of a highly connected network are roadways, railroads and energy networks. Even the now so popular Blockchain is a perfect example of a chain of information, where you can track how money is transferred from A to B and from B to C. I recommend this learnmeabitcoin website where you can quickly search through all the bitcoin blocks and to find the path between multiple bitcoin addresses. 

In a later blog post I will explain more about graph algorithms. How to find patterns in your data and highlight the most important nodes in the network. I’ll explain how to use the famous Dijkstra Shortest Path algorithm. And explain how to detect communities within your dataset by using ‘label propagation’ to cluster your data. Another method I’m going to explain is ‘closeness centrality’ to find influencers within a cluster and ‘betweenness centrality’ to detect the important influencers across clusters. And, of course, I’ll demonstrate the PageRank algorithm! Please post in the comments what is on top of your mind! And please give your feedback and ideas about this or other interesting topics.

Most famous for path finding is the Dijkstra – shortest path – algorithm

These graph algorithms are very interesting and the ‘out-of-the-box’ possibilities with the – at the moment the most popular graphdb – Neo4J. I’m also happy that I earned this certificate last year so I can officially call myself a certified Neo4J Professional!

 

I will keep you posted. Let’s DO IT!

Note: Please subscribe to immediately receive all new blog posts once they are published. You can subscribe using the form at the top right-hand side. Lets see you soon! My next blog post is about Spark and Azure and is based on the online course I’m currently attending: Implementing Predictive Analytics with Spark in Azure HDInsight

5 thoughts on “Graphs are everywhere!

  1. Very much related but also very different from how I’m looking at data in my field. I’d love to read more of your blogs!

  2. I used graph theory-based model called Fuzzy Cognitive Mapping (FCM) on database driven from an online platform in my research. I accept your argument about the importance of this simple, robust theory.
    I like your post and look forward to read the next one soon.

  3. A cognitive map is also a directed graph indeed. It roughly connects facts through a series of options (whereas an option also plays the role of issue) towards values or goals. Such a map can be used to model business rules such that these serve business goals. I can imagine that algorithms may be capable to determine the most efficient business rules model (e.g. the most efficient business plan).

Leave a Reply

Your email address will not be published. Required fields are marked *