Data Science Resources for Beginners
KDnuggets

Data Science Resources for Beginners

In my coursework and projects I have had to search for many resources (practice datasets, intuitive explanations of statistical and ML models, coding tutorials) and would like to share the best ones for anyone looking to build on skills acquired from coursework. While 'real-world' data science is a tad more complex (you have to accurately create your own modeling data for a start), these resources should be excellent preparation.

Anyone with no knowledge of data science can look up a few lines of code on the internet to make a model these days. What is important is to understand the math behind it to be able to adapt it to different business situations or at least understanding it intuitively to explain the output. I will share some great learning resources which I found out after lots of research with a blend of intuition, theory and application as well as some great ways to practice these skills.

Learning:

Decision tree (a 'classification' algorithm) is a great place to start. They are easily interpretable, not super math heavy and a gateway to much more powerful algorithms. Even if the purpose is not prediction, as it is not a 'blackbox' model it can be used for segmentation (like say, breaking millions of customers of a credit card company into segments with higher/lower than average borrowing need).

The following tutorial not only provides an intuitive explanation and visualization of decision trees, as a bonus it teaches R and builds on trees to introduce more powerful algorithms and concepts like random forest and feature engineering:

While we are on the topic of Random Forests, see below for another intuitive explanation:

Similar to Decision Tree, Regression is also a good model to start. While its full interpretation can be complex, for prediction purposes just a portion of it is needed. To predict numerical output the following tutorial is an an excellent guide, a blend of intuition, math and application (in Python):

https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/


Practice:

Armed with this knowledge it is a good time to start some hands-on work. Kaggle competitions are a good bridge between coursework and what is required in a job. It has messy, real-world datasets (difference from real-world just being that these datasets don't need to be built from scratch).

The following is a very good first competition to take part in to practice classification (predicting 0's and 1's):

For numeric prediction, see the competition below and a great data preprocessing code below it:

These resources should help solidify and refine the skills of someone graduating with a data science degree and prepare them to tackle their first real-world project.



To view or add a comment, sign in

More articles by Mehdi Mujtaba

  • LinkedIn Recruiter Spam

    One somewhat frustrating thing about this platform is getting messages in Focused Inbox for contract roles despite…

    1 Comment
  • Creating Business Value as a Product Data Scientist

    While a Machine Learning Data Scientist might be able to create measurable business value by, say, improving product…

  • Djokovic's Tiebreak 'Lockdown Mode' - By the Numbers

    Delving deeper into Djokovic's recent French Open win, one stat was absolutely mind-blowing: not only did Djokovic win…

  • SQL Data Scientist Interview (Part 2)

    In the last article we went over SELF JOINS: Apart from them, there was another type of rather interesting and…

    1 Comment
  • SQL Data Scientist Interview (Part 1)

    SQL is the the most fundamental skill for a data analyst/data scientist-analytics/insights/product (whatever you want…

    1 Comment
  • Automating (aspects of) Job Search - LinkedIn Scraping

    In my last article I (automatically) went through my Gmail Inbox to find recruiter email addresses and send…

  • Automating (aspects of) Job Search

    After 2 years at Quotient Technology (formerly Coupons.com) I was recently back on the job market: One aspect of this…

  • Federer vs. Djokovic : A data-driven Analysis

    Since 2011 Djokovic has been the dominant player on the Mens’ Tour, which is an understatement. Not even the foremost…

  • Spend Tracking and Analysis

    For the first few years after starting a job I barely kept track of spending. There were a couple of half-hearted…

    1 Comment
  • San Jose to Honululu Airfare Analysis

    One of the perks of working on the West Coast is closer access to places like Hawaii and Alaska. Having already…

    2 Comments

Others also viewed

Explore content categories