Focusing on implementing govt policies using the big data tool zeppelin

Deepak Singhvi

Published May 15, 2016

It was good to know from the government that it published lots of data collected over the period of time at https://data.gov.in/

I picked and amenities data about the villages from https://data.gov.in/catalog/village-amenities-census-2011 to do some analysis.

I believe government is doing sufficient analysis to find where and with what force it should use its machinery to promote its schemes.

I have been doing some analysis using the Apache Spark and eco system around it. But was interested in a quick visualization, which would help to understand the data quickly. A possible use would be using R as I wanted to build the reports quickly. I explored some of the capabilities of R and Shiny App in my earlier post of Custer Analysis of banking data.

Recently I came to know about a fantastic tool, its a web based notebook, with the in-built support for Apache-Spark, with a support of multiple langues like Scala, Python, spark sql and so on and most important that this it is opensource.

"Zeppelin"

I picked one of the csv from the the whole data, and which is for one of the district in Karnataka state is Gulbarga and started doing some analysis.

Loading the data into the dataframe/table.

It is easy to accommodate spark sql also in the notebook paragraph/sections.

Following is a very simple query to show the population spread in the villages of Gulbarga district.

Government make policies and spend money on that, and find the effectiveness of it based on the result. We can use the collected data to understand where should be the maximum penetration of the schemes, i.e. find the villages which needs the government schemes most. One of the example where government can initiates its policies to reduce the gap of male-female ratio, we can understand from the data available, where should be the more focus.

Changed the minbenchmark to 80% and same got updated on the fly.

I stated to analyse this data to check for the education facilities in the villages which is in progress, would be publishing that information in later posts.

Installation details:

a) For this analysis Zeppelin was deployed on Ubuntu VirtualBox with Windows as host.

b) Set your java home (1.7) before starting Zeppelin.

c) To start or stop execute 'zeppelin-daemon.sh start' or 'zeppelin-daemon.sh stop' respectively in the ZEPPELIN_HOME\bin

Sunay Sp 9y

I recently worked with scalding. Zeppelin supports scalding as well. We used to test the code on small part of the data and not really for the reporting.

1 Reaction

Deepak Singhvi 9y

I have not done yet with zeppelin, otherwise yes i did that and I think its possible with zeppelin too, We can use angularjs component with zeppelin for e.g. Sunburst https://bl.ocks.org/kerryrodden/477c1bfb081b783f80ad

Anthony Loupos 9y

Hi Deepak! Have you found a way to link visuals to simulate drill down etc?

Deepak Singhvi 9y

thank you. :-)

AJAY SURANA 9y

Is it useful for non software persons

See more comments

To view or add a comment, sign in

Focusing on implementing govt policies using the big data tool zeppelin

Deepak Singhvi

More articles by this author

Others also viewed

Stackless Challenge #1 - Implementing SCD Type 2 in Microsoft Fabric Lakehouse with PySpark

PySpark to Pandas for COVID-19 Data

Why you should never join Data frames on a function (Apache Spark)

A scenario on processing JSON using Pandas.

Spark UDAF with window function & Groupby

Are you interested to know how a query expression actually expands and why queries are suitable for even infinite sequences? Let's see!

Crafting an Object Engine

Easy sentiment analysis with SQL and Microsoft Fabric Spark Notebooks, including synthetic data generation

Why R might not be bad for you, but drag&drop will

Arrow vs Parquet, or really is it?

Explore content categories

Scaling with MicroService based Architecture

Jan 26, 2017

Cluster Analysis Using R - Banking Insight Study

Oct 10, 2014