Data Science Resources for Beginners
In my coursework and projects I have had to search for many resources (practice datasets, intuitive explanations of statistical and ML models, coding tutorials) and would like to share the best ones for anyone looking to build on skills acquired from coursework. While 'real-world' data science is a tad more complex (you have to accurately create your own modeling data for a start), these resources should be excellent preparation.
Anyone with no knowledge of data science can look up a few lines of code on the internet to make a model these days. What is important is to understand the math behind it to be able to adapt it to different business situations or at least understanding it intuitively to explain the output. I will share some great learning resources which I found out after lots of research with a blend of intuition, theory and application as well as some great ways to practice these skills.
Learning:
Decision tree (a 'classification' algorithm) is a great place to start. They are easily interpretable, not super math heavy and a gateway to much more powerful algorithms. Even if the purpose is not prediction, as it is not a 'blackbox' model it can be used for segmentation (like say, breaking millions of customers of a credit card company into segments with higher/lower than average borrowing need).
The following tutorial not only provides an intuitive explanation and visualization of decision trees, as a bonus it teaches R and builds on trees to introduce more powerful algorithms and concepts like random forest and feature engineering:
While we are on the topic of Random Forests, see below for another intuitive explanation:
Similar to Decision Tree, Regression is also a good model to start. While its full interpretation can be complex, for prediction purposes just a portion of it is needed. To predict numerical output the following tutorial is an an excellent guide, a blend of intuition, math and application (in Python):
https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/
Practice:
Armed with this knowledge it is a good time to start some hands-on work. Kaggle competitions are a good bridge between coursework and what is required in a job. It has messy, real-world datasets (difference from real-world just being that these datasets don't need to be built from scratch).
The following is a very good first competition to take part in to practice classification (predicting 0's and 1's):
For numeric prediction, see the competition below and a great data preprocessing code below it:
These resources should help solidify and refine the skills of someone graduating with a data science degree and prepare them to tackle their first real-world project.
Umer Rabbani
Thanks for sharing Mehdi!
Top 33%.