Machine Learning Models
Models use machine learning algorithms, where the machine learns from data just like how humans learn from their experiences. We can classify these models based on their task and nature of output as follows.
- Regression
- Classification
- Clustering
Regression:
When we talk about predictive analytics in data science we come across a term called regression, which uses single/multiple variables or features also known as independent variables to predict the outcome of a dependent variable.
Regression is a supervised learning technique that is a commonly used predictive analysis model, which aims at building a relationship between a dependent variable also called a target variable, and independent variable(s) called a predictor variable(s). It helps us in understanding how the values of dependent variables change with respect to independent variables.
The Regression model defines a line that passes through all the data points such that the vertical line between the data points and the regression line is minimum. This distance tells whether the relationship captured is strong or not.
A classic example of a regression problem would be the marketing vs. sales problem, wherein a company wants to determine by how much the sales would go up if they invested a certain amount in advertising.
Regression is mainly used in making predictions, forecasting, time series modeling, and determining causal relationships among the variables.
Classification:
Classification is another supervised learning technique that is used in predictive analysis wherein the output variable is a categorical variable that is predicted based on the Independent predictor variables.
Let us say we want to classify an incoming mail as spam or not spam also called ham in this problem resort to solve it using a classification model where we treat the output variable i.e. Spam/Ham as target and predictor variables such as words present in the subject/body as predictor variables.
Classification allows you to make predictions from labeled data and output could be either a binary classification problem or a multiclass classification problem.
Clustering:
In the supervised learning technique, we work with a known dataset where the predictor or independent variables and target or dependent variable are defined, in other words, we work with a defined notion of labels.
Clustering is an unsupervised learning technique where there is no such notion; it simply collates or group’s similar features of a given set of inputs based on similarities in a pattern such as a shape, size, color, and behavior.
In a more elaborative we could segment customers based on the behavior of the customer, intention/attitude of the customer, or demographically based on gender, age, location, income, etc. In hindsight, we could say that clustering is an analysis technique and segmentation is a business problem or case.
While making clusters we are concerned with 2 objectives them being maximizing the intercluster variance and minimizing the intracluster variance. We can consider a simple example of segregating the different types of fruits and vegetables sold by a vendor.