The data perspective
This post is going to be about data, or, more precisely, about the lack of data. I first thought of calling it “The Poverty of Data”, but then worried that it would first attract, and then disappoint, economists. It was therefore perhaps best to leave the matter to Amartya Sen, or, now, Abhijit Banerjee.
I then considered “The Paucity of Data”. This certainly seemed less pompous, but what exactly is the difference between poverty and paucity? Or between possibilities and probabilities?
--
Every time you have to explain subtle differences, cricket can help you out. In the 1960s, and indeed even the 1970s, cricket data was what you recorded in the cricket score sheet. This sheet was quite admirable: there was information about every scoring shot by the batsman, and every ball bowled by the bowler. The sheet could tell you that Sunil Gavaskar hit a 4, it could tell you that Chris Old was hit for a 4, but it couldn’t tell you that Sunil Gavaskar hit Chris Old for a 4.
In simple terms, the data did not connect the batsman to the bowler on a ball-by-ball basis, even though cricket is essentially an agglomerate of what happens every time a bowler bowls to the batsman. If the data fails to explicitly tell you this then it is data poverty.
Let’s continue with the cricket example and talk of Bapu Nadkarni. Bapu, who passed away recently, is best known for his spell of 32-27-5-0 in the first test at Madras against the visiting MCC team in 1964. The score sheet was capable of counting his 27 maiden overs, but couldn’t tell you how many of these 162 dot balls were pitched outside the off or leg stump. With such data one could’ve guessed Bapu’s (and his captain Pataudi’s) bowling strategy better. That’s data paucity for me. Data poverty comes from poor data design while data paucity is often due to poor data recording practices.
When you capture data better, you improve the possibility of learning and inference. I consider the IPL to be cricket’s greatest data learning laboratory: who is the best end-over bowler, which batsmen would you pick to bat between overs 8-15, how much does dot ball pressure influence performance, why would you always ask Parthiv Patel to open the innings?
--
All these data ruminations are the outcome of my visit to Zeal’s Dnyanganga English Medium School in Pune last Saturday. I was part of a team looking at about 75 S&T projects put together by students of classes 6-9 from different schools in their Atal Tinkering Labs.
I feel ashamed to admit that I had known nothing about this Government of India initiative to promote the spirit of innovation and entrepreneurship in our school kids; there are currently over 1500 schools in the country with Atal Tinkering Labs, and, if the smiling faces of the children is any indication, the idea has truly taken off.
The school teachers told me how much the kids were enjoying the experience. “They don’t enjoy the classes as much as their visits to the Lab”, they told me (is there a story there?). The projects deal with specific themes: water management, healthcare, renewable energy etc.
The key idea in many of the projects was to build some response system based on sensors: e.g., build a walking stick for a blind man to cross the road, or sense water levels in dams to trigger a suitable alarm, or train a robot to clean footpaths, or help deliver medicines to a physically challenged elder … and so on. In most cases the communication was via a hand-held device, often a smartphone. The students said that the project idea came from a parent, a teacher, or by watching YouTube videos.
Talking to students who had built an alarm system, I asked if it was possible to count the number of alarms generated in 24 hours, or indeed to ascertain how many of the alarms were false positives. I also asked students measuring water levels in a dam if they could create a chart to record water levels on every day of August.
In every case the reaction was: “Oh, we never thought of that!”. I can hardly blame school children for not thinking of the data aspect. But their teachers and mentors certainly should. I worry that data is still not considered an integral part of an innovative endeavour, and is often an annoying afterthought.
We’re supposed to be in the middle of a data revolution: we’re talking big data, machine learning, artificial intelligence … but nothing is really possible without data. It is actually a cultural thing: when college students do a lab project on machine learning, the data is almost never from India.
We have to quickly graduate from data poverty and data paucity to data possibilities. And only then will we be ready to calculate data probabilities. To start with, let’s tell every kid in every school that every device they build must also be seen as the provider of data, and thereby of knowledge, intelligence and innovation. This will make the wonderful idea of Atal Tinkering Labs even more atal.
Very interesting and important point. Data is the crux and we cannot ignore or neglect that. Thought provoking.
You have made a very important point! Our culture is such, arguably, we often falter on the fundamentals. Having said that, designing data collection strategy from the start for numerous possibilities of questions that can be answered analysing that data, may not be an easy job. Data collection must be part of any system design.
Hi Srinivas, Your article is a discovery by serendipity. I enjoy reading your style of explaining. How are you? We lost touch with watch other. Thanks to your blog that helped in rediscovery! Sathya
This is second level sir. First we helped making them think in practical application of science here rather than science practicals only. The next level is towards that only. In fact data analysis was part of submissions but not all.
Very true... We see the impact of paucity of data at our work almost everytime.