Create Real-time Dashboard with Data Stream using Google Pub/Sub, BigQuery, Dataflow & DataStudio
There are two ways of processing Big Data means complex data, the volume, velocity and variety, which are too big to be handled in traditional ways
1- Batch data
Batch data means you have some historical data in a data warehouse and you are ingesting into a machine learning model for analyzing and finding the information out of it. So it is typical Data in Rest.
2- Streaming Data
Streaming data is a continuous inflow of data which is quickly processed in order to get real time insights. Simply this is Data in Motion.
For example: Zomato , Ola, Uber based on number of requests, deploying more resources for fulfilling the requests in real time, other example can be power grid monitoring throughput etc.
Lets take a use case:
You own a cab service in Newyork city and want to know how your business is doing in real time.
Source of data: New York City Open Data ( https://opendata.cityofnewyork.us/)
STEP -1:Connect a streaming data topic with cloud pub/sub
Lets understand Cloud pub/sub
Pub is publisher and sub is subscriber. Both publisher and subscriber connect with each other through shared String called Topic. Publisher application creates and sends a message to the topic and the subscriber application creates a subscription to the topic to receive a message from it. Therefore there may be many publishers and subscribers for a topic.
GO to API & Services-->Enable Cloud Pub/Sub API & Dataflow API
We are using google BigQuery which is a server-less data warehouse and tables in BigQuery are stored in the form of DataSets.
Step- 2: Create taxirides BigQuery dataset. Go to google console and type
$ bq mk taxirides
Step- 3 : Create taxirides.realtime table
Here taxirides is dataset and realtime is table and above we have defined the schema for real time table. We are making timestamp as a partitioning field.Now your Datasets is ready. This needs to be stored. We have Cloud Storage Bucket to store Datasets
STEP-4: Create Storage Bucket
In GCP Console→ Storage→ CREATE BUCKET→ NAME (your project id) → Default Storage Class= Multi-regional → Choose location (nearest to you) → CREATE
STEP 5: Setting up Cloud Dataflow pipeline
Cloud Dataflow is a serverless way to carry data for analysis.
How?
In GCP console → Dataflow→ CREATE JOB FROM TEMPLATE→ Enter Job Name→ Cloud Dataflow template, select the Cloud Pub/Sub Topic to BigQuery template→ Cloud Pub/Sub input topic, enter projects/pubsub-public-data/topics/taxirides-realtime → Under BigQuery output table, enter <myprojectid>:taxirides.realtime → Under Temporary Location, enter gs://<mybucket>/tmp/
This is how your pipeline will look like post processing:
STEP 6: Analyzing Taxi data using BigQuery
Go to GCP console→ Select BigQuery
Query-1 to verify your data is available in BigQuery
SELECT * FROM taxirides.realtime LIMIT 10
Query-2 is the query for reporting
Here is how your query output looks
STEP-7 Real Time Dashboard using DataStudio
And here we go...this is how you can see the real time statistics of total rides vs Total Passengers vs Total Revenue in the dashboard. Congratulation.... you are able to view the real time statistics of you taxi company.
Keep Learning..Keep Sharing..
Why do you group by minute and timestamp, and then again group in the report? Why do you sort by record number? Makes dashboard sort meaningless. Unfortunately DataStudio can’t show real time data as its max refresh rate is 15 minutes.
where is the part you feed the data to the topic?