Features Of Scikit-Learn
Scikit-Learn, also known as sklearn is a python library to implement machine learning models and statistical modelling. Through scikit-learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models. It also provides functionality for dimensionality reduction, feature selection, feature extraction, ensemble techniques, and inbuilt datasets.
This library is built upon NumPy, SciPy, and Matplotlib.
Datasets:
Scikit-learn comes with several inbuilt datasets such as the iris dataset, house prices dataset, diabetes dataset, etc. The main functions of these datasets are that they are easy to understand and you can directly implement ML models on them. These datasets are good for beginners.
Data Splitting:
Sklearn provided the functionality to split the dataset for training and testing. Splitting the dataset is essential for an unbiased evaluation of prediction performance. We can define what proportion of our data to be included in train and test datasets.
Linear Regression:
This supervised ML model is used when the output variable is continuous and it follows linear relation with dependent variables. It can be used to forecast sales in the coming months by analyzing the sales data for previous months.
Logistic Regression:
Logistic Regression is also a supervised regression algorithm just like linear regression. The only difference is that the output variable is categorical. It can be used to predict whether a patient has heart disease or not.
Decision Trees:
A Decision Tree is a powerful tool that can be used for both classification and regression problems. It uses a tree-like model to make decisions and predict the output. It consists of roots and nodes. Roots represent the decision to split and nodes represent an output variable value. A decision tree is an important concept.
Decision trees are useful when the dependent variables do not follow a linear relationship with the independent variable i.e linear regression does not accurate results.
Bagging:
Bagging is a technique in which multiple models of the same type are trained with random samples from the training set. The inputs to different models are independent of each other.
For Ex- Multiple decision trees can be used for prediction instead of just one which is called random forest.
Boosting:
Boosting is a technique in which multiple models are trained in such a way that the input of a model is dependent on the output of the previous model. In Boosting, the data which is predicted incorrectly is given more preference.
Ex- Ada Boost, Gradient Boost.
Random Forest:
Random Forest is a bagging technique in which hundreds/thousands of decision trees are used to build the model. Random Forest can be used for both classification and regression problems. It can be used to classify loan applicants, identify fraudulent activity and predict diseases.
Support Vector Machines(SVM):
Supervised Vector Machine is a supervised ML algorithm in which we plot each data item as a point in n-dimensional space where n is the number of features in the dataset. After, we perform classification by finding the hyperplane that differentiates the classes very well. The data points which are closest to the hyperplane are called support vectors. It can also be used for regression problems but generally used in classification only. It is used in many applications such as face detection, classification of mails, etc.