First steps into Data Science

First steps into Data Science

In a nutshell:

I used K-Means clustering to identify how a very varied and diverse bunch of students can be classified into groups, and what they'd look for when thinking about student accomodation. I used K-Means to also classify the currently available housing options to find out which kind of student should pick which option.

You can find the entire project here

The Problem:

Simply put, the problem is this:

“How can one identify trends in an individual’s* daily routine and leverage them?”

The devil is always in the Data:

I used a variety of sources, primary among them, the foursquare Database via the API:

What's a perk anyway?

As can be seen , it was very messy, and not usable out of the box. A lot of cleaning needed to be done before it was in a usable state.

Another Dataset I used was “Food choices” by BoraPajo. Since this dataset was assembled by someone already involved in Data Science (being hosted on Kaggle after all), this dataset was much easier to use, and I simply needed to slice the data relevant to me.

No alt text provided for this image

To clean the Foursquare data to a usable state, I dropped all irrelevant entries like categories, hasPerk, etc. and retained only information like Location Name, Address, and its latitude and longitude. The problem with Foursquare data is that it is very finicky to call the API: sometimes the queries would return 2, sometimes 50 locations depending on the search word. The results varied wildly with minute changes. 

I had to identify the top categories that represented the Food and Grocery categories, as well as restaurants which were amusingly overshadowed by bus stops in the initial queries.

Gathering insights

The first thing I did to understand the students' data was draw a boxplot:

No alt text provided for this image

Looking at this, I immediately realised a few things:

  1. In general, students tend to cook now and then, and eat out the days they don’t.
  2. Ethnic food is enjoyed by a wide host of students
  3. Nearly everyone exercises daily.
  4. Fruits and vegetables figure pretty high on the food list for students
  5. Income is high for most students in this set.
  6. A very high number of people stay on campus
  7. On average, students are willing to pay roughly $25 for a meal.(Very high for Indian standards!)

Of course, since this graph is coded on BoraPajo's own design, you'll have to go to their page and read the code book to figure out the rest yourself!

Next, I used K-means clustering and plotted another boxplot to gain insight as to how these students might be divided:

No alt text provided for this image

After a lot of time spent, and a lot of tinkering with the amount of clusters (the K-value), I gathered that:

K-means Clustering on Student Data

  • One cluster of high income students seem to eat out more often and spend more per meal, and care less about fruits, vegetables and ethnic food, and stay almost exclusively on campus.
  • The second cluster of high income students seem to eat out less, pay less per meal, and are more likely to stay off campus.
  • The cluster of low income students eat out less, and cook more often than the high income group. They eat as much vegetables as the second cluster but eat less fruits (perhaps because fruits are more expensive than vegetables?) and are most likely to stay off campus.

Alright, now that we know a fair amount about these students, lets find them homes!

I queried the Foursquare database for possible locations where student accomodation could be set up and got the results in that rough format. Next, I plotted a map using Folium:

No alt text provided for this image

Then, I used K-means again and clustered the locations into 3(The magic number again!) clusters:

No alt text provided for this image

K-means Clustering on Location Data

Three prominent clusters emerged after applying the method on the data:

  • Cluster 0(Green) Where both (fruits and vegetables) and (restaurants) are abundant
  • Cluster 1(Yellow): Restaurants are plentiful, but groceries less so.
  • Cluster 2(Red): Restaurants and groceries are relatively hard to find.

Making sense of everything:

Ideally, students should be maximised at the Green(Cluster 0) locations since both kinds of students can be catered to there, and obviously, unless renting their own house, it’s very difficult to open a new housing for just a few students!

Another aspect to think of is cost. One can easily notice, the further away from the college and the closer to the city centre one gets, the more options one finds for food. The same can be said about other amenities as well. The closer to the city centre, the more expensive property gets, as well as the cost of living. Therefore, in reality, Cluster 1 locations might be better value for money. 

Finally, Cluster 2 locations, while not ideal, offer the shortest travel times to college, and may be viable for students willing to compromise on food or making alternative arrangements. With the advent of food delivery apps, it is quite easy to get both groceries and prepared meals both, so there might be a few locations which could be classified as Yellow or Green depending on coverage.

One thing I would like to note is that the Foursquare data seems incomplete; Many locations seem to be missing or ill-classified. India definitely needs better locational data sets!

This was my first serious foray into Data Science, and I feel much more wiser and knowledgable now that I've successfully finished this task. A Paulho Coelho quote comes to mind:

“People never learn anything by being told, they have to find out for themselves.”






To view or add a comment, sign in

Others also viewed

Explore content categories