Big Data Graph Databases
We live in a world where the data available is overwhelming and requires data stores which can handle large data sets effectively such as - Hadoop, Datatorrant RTS etc. To facilitate faster decisions, we will need a database which can give us decisions instantaneously. To do this using traditional databases, it involves being able to join multiple tables quickly resulting in quite a bit of latency. To avoid the latency issues involved in RDBMS we can look at Graph databases when instantaneous decisions or recommendations are involved. One of the most popular Graph Databases is Neo4j. According to Dbengines ranking, it is rated as the top Graph Database with a score of over 38.7.
Choosing a Graph Database is merely the beginning of the process. There are several questions we need to answer to effectively use one. To begin with we need to answer at least the following three questions:
1. What kind of data should we store in a graph database like Neo4j?
2. How can we load the data into the graph database?
3. Now that the data is loaded into the database. How can we optimize the performance of the database?
We will share our own experiences with Graph databases through these articles.
Before we begin the journey, we would suggest our reader to adapt these insights to your own respective domains, to look at graph databases with the following things in mind.
Understand what kind of questions you are going to pose the data. Questions that you can pose a graph database can be of the following kinds grounded in Graph theory.
- Finding a Node type or property based on its association
- Finding a link type or property based on the nodes it is associated with
- Finding a path between two nodes that may require traversing through various other nodes
- Ranking nodes based on their relationships and weights etc.
Unlearn some of the RDBMS concepts. Relational databases in spite of their name, tend to be poor when it comes to understanding the relationships within the data elements.
To learn to do anything effectively one needs to do it. In this article and the series of articles, we will show how large data sets stored in Hadoop can be brought into Neo4j to effectively answer certain types of questions.
We will cover the data loading process and the preparation of a data load project.
Here is a high-level overview of the process. Each of these steps we will cover in a later article in greater detail.
Step 1 : Building the Graph Database prototype
1. Finding the data sources related to the business problem.
2. Analyzing the data.
3. Know about the production environment.
4. Installing and configuring the neo4j.
5. Creating the Graph structure.
6. Create the sample graph and test the usability validation
7. Identify and create the Indexes.
8. Corrections to the sample model.
Step 2: Groom the Import tool with Sample Data
1. Data Import queries
2. Data Headers and Neo4j import tool readiness.
3. Data extraction/import from the source database.
4. Create the Neo4j Database.
5. Refactoring the graph
6. Enriching the graph
7. Validation of counts source database to Neo4j database.
8. Note the duration of each query of extraction, loading, enrichment and retrieval cypher queries.
9. Stupid-Data test.
As Data loading process is time-consuming, to make sure new data is not creating new challenges, better do the Data import process steps to be performed on the different sizes of the data smaller to larger and check the purpose of the database is met.
Step 3: Create a Moderate sized Database
1. Modify the queries to address the new set of data.
2. Estimating the average time for each of the queries Depending on the duration of test data.
3. Run the step#2.3 to Step#2.7
4. Optimizing a long time taking queries
Keep all the databases which were created part of Step#1 thru Step#3. These will be used for testing the incremental load.
Step 4 Build the Full-Scale Database
1. Modify the queries to address the new set of data.
2. Run the step#2.3 to Step#2.7
Step 5 Maintain the Database – Insert/Update/Deletes
1. Analyzing the master data and transaction data.
2. Planning and developing queries for master data updates
3. Planning and developing queries for transaction data updates
4. Preparation for validation queries for incremental load
5. Test the incremental load on the database created in Step#1
6. Test the incremental load on the database created in Step#2
7. Test the incremental load on the database created in Step#3
8. Optimize the incremental load queries if required.
9. Run and validate the incremental load queries on the production database.
10. Schedule the incremental load.
We are going to cover the details of each step in upcoming articles.
Lessons learnt during building extremely large Graph data bases:
- We normally don’t have control of data coming in. Especially when we are dealing with unstructured data coming into Hadoop and transformed into semistructured parquet files. It is usually a better strategy to be skeptical about the quality of the data. Simply put "Always Suspect the data before rather than Suffer later".
- Data which is not useful for decision making should be discarded rather than sending it to the graph. Better to Discard such data as early as you can in the process.
- Today with Cloud CPUs, memory and file sizes may seem to be of no consequence. However when building large graphs, we will be dealing with memory intensive operations.Be judicious in the choice of data to be sent to the Graph Database.
- Decide upon how we are going to use the data and then determine the Data types, rather than how it is stored in the source.
- We will need to use several calculated variables which we may reuse several times. It is better to persist these calculated variables rather than re-calculating them each time.
- Account for the Graph design to evolve over the iterations. Don't get attached to any design. If the Graph that we built cannot answer our questions, take a hard look at the design and consider changing it. The datastore should provide insights and if the design is a hindrance in getting these insights from data, then it is worthwhile spending time in redesigning it.