Data Analytics for Blockchain
Towards privacy preserving data science

Data Analytics for Blockchain

1. INTRODUCTION

The aim of this article is to demonstrate the concept of blockchain and private distributed ledger with respect to privacy-preserving data science. The report also states the currently used blockchain analytics techniques for blockchain-oriented data science, visualization techniques, and benchmarking of popular blockchain frameworks which highlight some significant limitations in Ethereum, Parity, and Hyperledger blockchain frameworks.

Blockchain technology has become popular in the last decade [1]. The blockchain is a public ledger that records a list of transactions in the distributed network. The first introduction of blockchain technology was by Satoshi Nakamoto who introduced Bitcoin cryptocurrency. Since then, the technology has found its place in many different decentralized systems. As technology revolutionized transactions and exchanges, it has been widely used in industrial applications. Some of the popular blockchain frameworks are Bitcoin, Ethereum, IBM Hyperledger, etc. 

Blockchains allow parties that do not trust each other to exchange transactions and agree on a common view of their balances. It enables auditability and traceability in order to detect potential malicious operations in a shared state. Blockchains can be operated reliably in a completely decentralized manner without the need to involve a central trusted instance.

This report assumes that the general underlying mechanism of the blockchain structure is well understood by the reader.

2. BLOCKCHAIN ORIENTED DATA SCIENCE

A blockchain framework is capable of generating huge volumes of data as it is designed to track every single transaction in the distributed network. Although, this develops the need to have some data analytics mechanism in place to keep a track of a number of things like money flows, user behaviour, transaction fees, analysis of entity relationships etc. With this, the need for some efficient underlying blockchain database for querying the blockchain data is also developed. There are also some limitations with respect to blockchains which restrict its wide-scale use. We will be discussing this in section 3. But for now, let’s assume that everything is good and we have an efficient data model for querying our blockchain system. Let’s keep this in mind while following this section. 

There are many visualization techniques being used and produced in industrial and academic literature [1]. Before analysing any blockchain system, it is important to understand every element in the blockchain framework. There are three types of blockchain frameworks:

  • Public - Anyone can participate and join the consensus.
  • Private - Fully controlled by an organization.
  • Consortium - These are semi-private blockchains that restrict the consensus process to selected groups.

Each of these frameworks may have most of the components in common. Let’s take an example of the Bitcoin framework and use this as our baseline for understanding how we can perform data analytics on a Bitcoin framework. Before jumping into analytics, let’s understand different components of a Bitcoin framework:

1. Transaction:

  • It is the most granular level of blockchain data.
  • Transaction records the transfer of value between addresses.

2. Blocks:

  • Transaction records are stored in blocks.
  • They hold and group a certain number of transactions.

3. Ledger:

  • Multiple blocks are connected in a linked list called Ledger.

4. Nodes:

  • Electronic devices that maintain and distribute a copy of the ledger in the blockchain network to maintain synchronization of data.

5. Miners:

  • Special nodes in the peer to peer networks that participate in verifying transactions and adding new blocks to the ledger, with the possibility to receive a reward.

6. Entity:

  • Represents real blockchain user or organization behind a transaction.

By knowing all the components within a blockchain system, we get to understand where to look for a particular type of data for doing analytics.

Tovanich et al. proposed a taxonomy of blockchain analytics for the Bitcoin framework based on analysis themes:

  1. Analysis of entity-relationship
  2. Metadata
  3. Money flows
  4. User behaviour
  5. Transaction fee
  6. Market wallets

This taxonomy helps to frame our analytics goals for analyzing any kind of blockchain framework depending on the task domain. For example, for a Bitcoin framework following can be the analysis goals:

  1. Anonymity
  2. Market analytics
  3. Cyber Crime
  4. Metadata analysis
  5. Transaction fees

Any data science activity for blockchain framework can be classified into two types:

  1. Analysis tasks - Based on the analytics goals.
  2. Visual representations - of blockchain data or system.

Tovanich et al. introduced a classification scheme of visualization on blockchain data:

No alt text provided for this image

In the proposed classification scheme, we are going to focus in detail on the Task Domains for data analysis which would give depth on the goals of blockchain analysis tasks based on task domains:

1. Transaction detail analysis:

  • The goal here is to analyze transaction patterns for individual blockchain components or derived entities in blockchain networks.

2. Transaction network analysis:

  • The aim of this type of analysis is to show three kinds of information namely, transaction networks, network entities, value flows tracing the transfer of cryptocurrency values through transactions over time.
  • They are mostly represented by directed bipartite graphs.

3. Cybercrime detection:

  • This includes visualizations that are able to detect suspicious transactions and entities or investigate cyber attack events.
  • A popular technique is a value-flow analysis which allows visualizing a deep transaction flow graph to trace money laundering and stolen Bitcoins.

4. Cryptocurrency exchange analysis:

  • This is used to visualize cryptocurrency conversions from crypto values into real-world currencies.
  • Mostly used for visualizing conversion rates for cryptocurrency values and market analysis.

5. P2P network activity analysis:

  • This type of analysis is used to visualize time series to show the aggregated statistics of P2P networks.
  • Map-based visualizations to show the geographical distribution of blockchain usage around the world.

6. Casual / Entertainment Purpose:

  • Such kind of analytics is used for visual encodings and understanding of blockchain networks, maybe for casual entertainment or entertainment purposes.
  • Example of such analytics if Bitcoin VR, it is an open-source project that visualizes Bitcoin transactions as balloons flying over a 360-degree view.


3. DATA MODELLING FOR BLOCKCHAIN FRAMEWORK

For performing efficient data analytics on a blockchain framework, it is important to have a managed underlying data model which enables systematic blockchain data analysis. El-Hindi et al. introduced the concept of Blockchain DB with the aim of coming up with a shared database on blockchains [3]. Blockchain DB introduces a database layer on top of existing blockchain framework that extends the blockchains by classical data management techniques. The aim of blockchain DB is to increase performance and scalability of blockchains for data sharing but also decrease complexities for organizations intending to use blockchains as DB.

As El-Hindi et al. mentioned, there are two main reasons why blockchains are not being used widely:

1. Limited scalability and performance:

  • Even the state of the art blockchain systems like Ethereum or Hyperledger can only achieve 10’s or max 100’s of transactions per second which is way below the requirements of modern applications.

2. They lack easy to use abstractions:

  • Current blockchain systems lack a simple query interface and guarantees such as well-defined consistency levels that guarantee when and how updates become visible.
  • Most of them come with proprietary programming interfaces.

Blockchain DB attempts to address these main issues for blockchain data management. It leverages blockchains as a storage layer. This way, existing blockchain systems can be used without modification as tamper-proof and decentralized storage. On top of the storage layer, Blockchain DB adds a database layer that implements partitioning & replication and query interface & consistency.

Blockchain DB also offers off-chain verification procedures in which clients can detect potential misbehaving peers in the Blockchain DB network. It also offers attacks and trust guarantees against changing shared state to anyone’s advantage and detecting malicious peers. It comes in two flavors, offline verification, and online verification. El-Hindi et al. mentions the advantages and disadvantages of each but specify that offline verification is preferred and is less costly. The following figure illustrates the structure of a Blockchain DB network as proposed in the paper by El-Hindi et al.

No alt text provided for this image


In the database layer, Blockchain DB additionally provides easy to use abstractions including different consistency protocols namely eventual consistency and sequential consistency which is described in detail in the paper. 

1. Eventual consistency:

  • For eventual consistency, a get-operation is immediately answered, however, no guarantee is given that a potential outstanding put operation has already been committed to the blockchain. 

2. Sequential consistency:

  • Here, the database layer needs to execute put/get operations in order and this potentially blocks a get-operation until an outstanding put-operation has been committed to the blockchain.


4. VISUALIZATION CHALLENGES IN BLOCKCHAIN

In section 3, we understood one popular blockchain data modeling technique using Blockchain DB. Having such a model in place will indeed make our data analytics and visualization tasks easier. Now, let’s back to the visualization and data analytics understanding which we gained in section 2. We understood how analytics taxonomy can help us set out analysis goals and most of the analytics results in a visualization that helps convey the information effectively. Although, there are some significant challenges identified by Tovanich et al. in their paper. Let’s discuss each of them:

1. Multiple views of visualization of blockchain data:

  • Currently, most of the visualizations are single view charts showing particular blockchain measures over time.
  • These single disconnected views make it difficult to release multiple blockchain characteristics to each other.

2. New visual representations for transaction network and analysis:

  • Existing visualizations present transaction networks and value flows as static graphs at specific points of interests and there is a need to have dynamic network visualizations.

3. Uncertainty visualization:

  • Sometimes, analysis tools may label certain transaction patterns as fraudulent and may make false predictions.
  • Any uncertainty in the data should be made evident in the visualization.

4. Progressive visual analytics:

  • The blockchain evolves over time, for example, continuous evolution for maintaining clustered Bitcoin entities up to date.
  • There is a need to have computationally efficient visualizations and progressive visual analytics for processing large amounts of data from the ever-evolving blockchain.


5. DATA ANALYSIS CHALLENGES FOR BLOCKCHAIN

Almost every blockchain system uses a key-value store as the underlying data storage mechanism. Key-value stores offer a great amount of flexibility and cost & space-efficient storage. Although, as we know, key-value stores are optimized only for data with a single key and value, when there are multiple nested key-value pairs, a parser is required. This also makes it inefficient for lookup operations, as lookup requires scanning the entire collection or creating separate index values. Since key-value stores are mostly unstructured, it becomes difficult to put foreign keys which might be needed for looking up values in a lookup or other tables. Although we still can create an intermediate table that would act as a key-value store for holding foreign key references, it is still a costly and less efficient affair. For data analysis, we often crunch a lot of data and view it from different perspectives and perform a lot of joins or aggregations on the data and it may become painful if the underlying data store doesn’t offer much support for this. Also, maintaining integrity and consistency among data attributes is a key thing to monitor, since there are no database constraints, it is easily possible to create new attribute names for the same value which might affect the maintainability, but due to the current programming practices, such mistakes may not occur frequently.


6. DATA ANALYTICS FOR BENCHMARKING BLOCKCHAIN SYSTEM

The purpose of this section is to describe a key application of blockchain-oriented data analytics which is Benchmarking. One of the prominent applications of any data analytics work is benchmarking. Dinh et al. propose the BlockBench framework for analyzing private blockchains. They present the first benchmarking framework for understanding and comparing the performance of permission blockchain systems. In their research, they conduct comprehensive evaluations of Ethereum, Parity, and Hyperledger which present concrete evidence of blockchain’s limitations in handling data processing workloads and reveal bottlenecks and these systems serve as a baseline for further development of blockchain technologies.

The aim of BlockBench is to provide a better understanding of the performance and design of different private blockchain systems. Their research attempts to find the limitations in Bitcoin, Ethereum, and Hyperledger systems. Interestingly, it is possible to link the understanding gained from sections 2, 3, and 4 to BlockBench. It takes into consideration the importance of data modeling for any blockchain framework to perform optimally and also demonstrates their finding in several different information visualizations based on several different task domains. Although, their main objective of data analytics on blockchain systems is to perform evaluations and bring up possible limitations and bottlenecks in a blockchain framework with respect to performance and data processing workloads. Since we are focusing more on data science, let’s understand the key limitations of each of the three blockchain frameworks analyzed by BlockBench. Below are the key bottlenecks they report with respect to each of the three blockchain frameworks they analyzed:

1. Hyperledger:

  • Blockbench identifies that there is a trade-off in data models.
  • The key-value model of the hyper ledger means analytical queries cannot be supported, although it enables optimization which helps to answer queries more efficiently.

2. Parity:

  • They identify that parity trades performance for scalability by keeping states in the memory.
  • They state that the bottleneck in parity is not due to the consensus protocol, but due to the server’s transaction signing.

3. Ethereum:

  • They found that Ethereum was more mature of the 3 systems benchmarked, yet not ready for mass usage according to their finding.
  • Ethereum incurs large overhead in terms of memory and disk usage for data processing.


7. CONCLUSION

This report focussed more on blockchain-oriented data analytics from the perspective of visualization, data modeling, and one of the applications of blockchain-oriented data science which is benchmarking. Although the key takeaway of the coursework is understanding privacy-preserving data science, considering that we are analyzing blockchain frameworks, it can be understood or assumed that the main purpose of the blockchain system is to offer anonymity from among peers in the distributed system which is its default characteristic especially when Bitcoin, Parity and Hyperledger frameworks were used as examples in this report. Having this as a baseline, it can be understood that we already achieved the privacy-preserving requirement and we focussed beyond this which is blockchain-oriented data science.

ACKNOWLEDGMENT

This work of mine was submitted by me towards the coursework in Privacy-Preserving Data Science at the University of Victoria and I would like to extend my gratitude towards Dr. Sean Chester for instructing this course.

I learned so much from this article. Thanks for sharing this great piece.

Hi Karan, great article. By any chance, do you happen to know any good blockchain data analytics or visualisation tool?

Out of curiosity, would you happen to have the reference papers you mentioned in this article? I want to read through them myself too~

Apparently the growth of Blockchain Technology inventions, Ecosystem & Data analytics are growing exponentially. The Blockchain-Oriented Data Science are indeed needful to reshape the trends of rendering professional services in this corresponding lockdown dispensation. These data analytics will conscientiously and expeditiously reinforce the usage of technology inventions and innovations..

To view or add a comment, sign in

More articles by Karan Tongay

Others also viewed

Explore content categories