Machine Learning - Lessons Learned as Intern

My Background

I am a junior at Rutgers University Majoring in computer science and minoring in Data Science. I have completed courses related to AI , Deep Learning, Data Science, etc. I am also Python certified and have spent many years working with Java and C. I have been learning computer science since my freshman year of high school starting off with Java and have been progressing my knowledge of other languages and programs since. Data science analytics have always been an interest of mine because of the many different pieces of information you can retrieve from a data set. When I started taking data science classes and Machine Learning classes at Rutgers, I was very enthusiastic to learn about the differences between traditional programming and Machine Learning, as well as how it could help me analyze data for small businesses, the stock market, and sports. 

I was fortunate to have my internship on AI and Machine Learning with an AI & Data Science focused startup company .The AI problem which the team was brainstorming and working to build a solution for, required applying Machine Learning Models to predict the data received  to  work on. This gave me inspiration to write about my understanding and experience and to explain the fundamentals of Machine Learning Application and how it is different from traditional programming .

This article describes my learnings and understanding.

Traditional Programming Vs Machine Learning programming <<pattern>> 

The traditional programming pattern in a broader sense  follows the great Mathematician John Von Neumann principle , the father of Modern Computers. The Principle consist of three main building blocks , Input -Processing-Output .The following diagram depicts that

No alt text provided for this image

  • Input Unit - collects the data needed and programming statements needed to build the logic for desired output .
  • Processing Unit - processes the logic of statements with input data provided and output format mentioned 
  • Output Unit - depending on the output format mentioned the output is created on a output device


The Machine Learning programming pattern is more of data driven processing .This pattern first divides the given data into train/test data (80%/20%) .The output/results expected from this input data when processed is known to us beforehand. We then build Training models/algorithms to work on the train data to get output/results nearer or same as known output. We keep tuning this model with test data to get desired output. This model is the tuned/trained model depending on the precision accuracy .This can be depicted in the following diagram 

No alt text provided for this image

Machine Learning Pipeline

The ML pipeline has three major steps - 

  1. Data Preparation , 
  2. Applying Models & 
  3. Integrate/Visualize Results.


This can be depicted in the following Diagram 

No alt text provided for this image

Data Preparation Learning 

Before applying Models , we learned a very important step of One-Hot Encoding of data when collecting and preparing data and making it meaningful for the Train/Test Data. If we have data which is non-numeric but categorical like colors - red , red , green, etc. then it is important to convert to numeric data like for colors converted to 0,0,1,etc before applying models .

Machine Learning Models 

The Two Categories of Machine Learning Models/Algorithms are

  • Supervised , 
  • Unsupervised learning 


The main difference between these categories is that the Supervised Model uses Labeled Data Sets. Usually the data has proper Column/Attribute names to identify the data. The Supervised Models are either Classification or Regression .Classification being discrete in nature and Regression being Continuous. I have worked on the Supervised Classification Model.  For example Diabetes Prediction Dataset is binary Classification with 8 input attributes and 1 output attribute which has following Column/Attribute Names - 

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).[Output -  has diabetes or not ]

The sample dataset can be as follows 

  1. 6,148,72,35,0,33.6,0.627,50,1
  2. 1,85,66,29,0,26.6,0.351,31,0
  3. 8,183,64,0,0,23.3,0.672,32,1
  4. 1,89,66,23,94,28.1,0.167,21,0
  5. 0,137,40,35,168,43.1,2.288,33,1

Conclusion

With my understanding , experience and learning for creating solutions in the Machine Learning/Data science world , I have made an attempt to write my learning. I do understand that this world

  • Is data driven and hence needs data to be explored and pre-processed in certain format
  • It also requires the application of a model or algorithm to predict outcomes
  • It requires data for training , testing and validating the models and make them stable


Will continue writing my experiences and share them in future. Thank you for reading.

Interesting read, I appreciate you taking time to share your knowledge to the world.

Like
Reply

Good Job Mukund. I liked the structure and simplicity in handling the topic. This gives good head start to audience of your age explore further.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories