Scribbling data with graph database
Source: Google, Kaggle

Scribbling data with graph database

In my current assignment I came across Neo4j which is graph database tool and uses Cypher query language to pull information.

A graph database is all about entity and their relationships. Each entity represents a node and (e.g. person, thing or any particular data). The relationship represents how to nodes are connected. Linkedin itself is a best example of graph database where each of us are nodes, our articles, posts and profiles are other form of nodes and all these information are related to us(nodes) with different relationships.

Cypher GQL is very syntactical to SQL with little bit different semantics. Trust me, its not that hard to learn.

So to get some hands on, I tried to get some interesting datasets on Kaggle.com. The one I choose here is from Netflix which details out the list of programs(titles), ratings, year of release and user ratings. The dataset is pretty old and is not perfect (has NA or blank entries, there are lot of duplicates and so on). But nevertheless, it gave me a good idea about graph database, so lets move on.

The first thing we have to do is to create a database and upload the data.

After creating the new graph, go to Manage and click Open Browser

The browser opens the cypher-shell from where you can execute the queries. We'll first of all try to import the data from csv. I have downloaded the data file netflix_shows.csv(you can alternatively provide the url in the query)

If the dataset is in the file system, to make it accessible to database, it should be under the folder specified under Settings tab in management console

Since I have kept the default as "import", the dataset should be under <neo4j install directory>/import folder. The file is accessible as file:///netflix_shows.csv

Here is the final query to upload the data. I have used Program as parent node which has relationships with title, year, ratings and user score.

LOAD CSV FROMfile:///netflix_shows.csv” AS row

// create a map from each row
WITH { title: row[0], rating: row[1], year: row[2], score: row[3]} AS data

// get-or-create nodes for title, ratings and user score
MERGE (t:Title {value:data.title})
MERGE (r:Rating {name: data.rating})
MERGE (y:Year {name: data.year})
MERGE (s:Score {name: data.score})

// create the Program node, set the full data as attributes
CREATE (p:Program) SET p = data

//create relationships 
CREATE (t) <-[:Title]-(p)-[:Rating]->(r)
CREATE (y) <-[:Year]-(p)-[:Score]->(s)

Once import is successful, we can see how the schema looks like

Since we have some data to play with, lets try to run some more queries. Just to remind, as the dataset is not perfect, there might be some discrepancies.

Let's say we want to find all the programs which have ratings as R(thats what I look for when watching with my 11 years old)

match path = (r:Rating{name:'R'})<-[:Rating]-(p)-[:Title]->(t) return path


So for example, 'Lucky Number Slevin title is related to multiple same name programs, this is just because of duplicates in dataset.

Let's find out how many titles were released each year

match (y:Year)<-[:Year]-(p)-[:Title]->(t) return distinct p.year, count(*) order by count(*) desc

As you can see, 2016 was the year maximum titles were released (this dataset is only till 2017)

A graphical view is bit bigger but a bigger picture (I trimmed it for demonstration purpose)

If I want to find out which titles have ratings more than 90

match (r:Rating)<-[:Rating]-(p)-[:Score]->(s) where p.score>"90"and not(p.score='NA') return distinct p.title,p.rating,p.score order by p.title

Here distinct with title is not necessary but I got to use it because of duplicates. Moreover as the dataset had NA entries, those have to be ignored too.

I have covered pretty entry level queries here but Cipher GQL has much more meat 🍖 in it!



To view or add a comment, sign in

More articles by Madhur Telang

Others also viewed

Explore content categories