Create Real-time Dashboard with Data Stream using Google Pub/Sub, BigQuery, Dataflow & DataStudio

Create Real-time Dashboard with Data Stream using Google Pub/Sub, BigQuery, Dataflow & DataStudio

There are two ways of processing Big Data means complex data, the volume, velocity and variety, which are too big to be handled in traditional ways

1- Batch data

Batch data means you have some historical data in a data warehouse and you are ingesting into a machine learning model for analyzing and finding the information out of it. So it is typical Data in Rest.

2- Streaming Data

Streaming data is a continuous inflow of data which is quickly processed in order to get real time insights. Simply this is Data in Motion.

For example: Zomato , Ola, Uber based on number of requests, deploying more resources for fulfilling the requests in real time, other example can be power grid monitoring throughput etc.

Lets take a use case:

You own a cab service in Newyork city and want to know how your business is doing in real time.

Source of data: New York City Open Data ( https://opendata.cityofnewyork.us/)

STEP -1:Connect a streaming data topic with cloud pub/sub

Lets understand Cloud pub/sub

Pub is publisher and sub is subscriber. Both publisher and subscriber connect with each other through shared String called Topic.  Publisher application creates and sends a message to the topic and the subscriber application creates a subscription to the topic to receive a message from it. Therefore there may be many publishers and subscribers for a topic.

GO to API & Services-->Enable Cloud Pub/Sub API & Dataflow API

No alt text provided for this image

We are using google BigQuery which is a server-less data warehouse and tables in BigQuery are stored in the form of DataSets.

Step- 2: Create taxirides BigQuery dataset. Go to google console and type

$ bq mk taxirides

No alt text provided for this image

Step- 3 : Create taxirides.realtime table

No alt text provided for this image

Here taxirides is dataset and realtime is table and above we have defined the schema for real time table. We are making timestamp as a partitioning field.Now your Datasets is ready. This needs to be stored. We have Cloud Storage Bucket to store Datasets

STEP-4: Create Storage Bucket

In GCP Console→ Storage→ CREATE BUCKET→ NAME (your project id) → Default Storage Class= Multi-regional → Choose location (nearest to you) → CREATE

STEP 5: Setting up Cloud Dataflow pipeline

Cloud Dataflow is a serverless way to carry data for analysis.

How?

In GCP console → Dataflow→ CREATE JOB FROM TEMPLATE→ Enter Job Name→ Cloud Dataflow template, select the Cloud Pub/Sub Topic to BigQuery template→ Cloud Pub/Sub input topic, enter projects/pubsub-public-data/topics/taxirides-realtime → Under BigQuery output table, enter <myprojectid>:taxirides.realtime → Under Temporary Location, enter gs://<mybucket>/tmp/

No alt text provided for this image

This is how your pipeline will look like post processing:

No alt text provided for this image

STEP 6: Analyzing Taxi data using BigQuery

Go to GCP console→ Select BigQuery

Query-1 to verify your data is available in BigQuery

SELECT * FROM taxirides.realtime LIMIT 10

Query-2 is the query for reporting

No alt text provided for this image

Here is how your query output looks

No alt text provided for this image

STEP-7 Real Time Dashboard using DataStudio

And here we go...this is how you can see the real time statistics of total rides vs Total Passengers vs Total Revenue in the dashboard. Congratulation.... you are able to view the real time statistics of you taxi company.

No alt text provided for this image

Keep Learning..Keep Sharing..

Why do you group by minute and timestamp, and then again group in the report? Why do you sort by record number? Makes dashboard sort meaningless. Unfortunately DataStudio can’t show real time data as its max refresh rate is 15 minutes.

where is the part you feed the data to the topic?

Like
Reply

To view or add a comment, sign in

More articles by Rahul Kumar

Others also viewed

Explore content categories