All important Machine learning algorithms in 2023
Over the past few years, I've compiled the most important machine learning algorithms based on my work experience, conversations with other data scientists, and what I've read online.
This year I want to expand on last year's article by offering more models and more models in each category. With this, I hope to provide a repository of tools and techniques that you can bookmark to help you solve a variety of data science problems.
*Most important algo. that are used widely in 2023
1-Pattern mining algo
2-Explanatory algo
3-Time series algo
4-Ensemble learning algo
5-Clustering algo
6-Similarity algo
1-Pattern mining algo-Pattern mining algorithms are a type of data mining technique used to identify patterns and relationships in a data set. These algorithms can be used for various purposes, such as identifying customer purchase patterns in a retail context, understanding typical user behavior patterns on a website/app, or finding relationships between different variables in research.Pattern mining algorithms typically work by analyzing large sets of data and looking for recurring patterns or relationships between variables. Once these patterns are identified, they can be used to predict future trends or outcomes, or to understand the relationships behind the data.
Algorithms-
Apriori Algorithm: An algorithm for searching sets of repeated items in an event database - powerful and widely used in associative rule mining tasks.
Recurrent Neural Network (RNN): A type of neural network designed to process sequential data due to its ability to capture temporal dependencies in the data.
Long Short-Term Memory (LSTM): A type of recurrent neural network designed to remember information over a longer period of time. LSTMs can capture long-term dependencies in data and are often used for tasks such as language translation and language generation.
Sequential Pattern Discovery Using Equivalence Class (SPADE): a method for finding repeated patterns in sequential data by grouping elements that are equivalent in some sense. This method can handle large data sets and is relatively efficient, but it may not work well for sparse data.
PrefixSpan: An algorithm for finding repeated patterns in sequential data by building a prefix tree and pruning rare items. PrefixScan can handle large data sets and is relatively efficient, but it may not perform well with sparse data.
2-Explanatory algo-One of the biggest challenges in machine learning is understanding how different models arrive at their predictions. We often know the "what" but try to explain the "why". Explanatory algorithms help us identify variables that have a significant effect on the outcome we are interested in. These algorithms allow us to understand the relationships between variables in a model, rather than using the model to make predictions.
Algorithms-
Linear/Logistic Regression: a statistical method for modeling a linear relationship between a dependent variable and one or more independent variables. It can be used to understand relationships between variables based on t-tests and coefficients.
Decision Trees: A type of machine learning algorithm that creates a tree-like model of decisions and their possible consequences. They are useful for understanding the relationships between variables by looking at the branching rules.
Principal Component Analysis (PCA): a dimensionality reduction technique that projects data into a lower dimensional space while preserving as much variance as possible. PCA can be used to simplify data or determine the importance of features.
Local Interpretable Model-Agnostic Explanations (LIME): an algorithm that explains the predictions of any machine learning model by locally approximating the model around the prediction, creating a simpler model using techniques such as linear regression or decision trees.
Shapley Additive Explanations (SHAPLEY): an algorithm that explains the predictions of any machine learning model by calculating the contribution of each feature to the prediction using a method based on the concept of "marginal contribution". In some cases it can be more accurate than SHAP.
Shapley Approximation (SHAP): a method to clarify the predictions of any machine learning model by evaluating the meaning of each feature of the prediction. SHAP uses a "coalition game" method to approximate Shapley values and is generally faster than SHAPLEY.
Recommended by LinkedIn
3-Time series algo-Time series algorithms are methods used to analyze time-dependent data. These algorithms take into account temporal dependencies between a series of data points, which is particularly important for forecasting future values. Time series algorithms are used in a number of business applications, such as forecasting product demand, forecasting sales or analyzing customer behavior over time. They can also be used to detect anomalies or changes in data.
Algo-
Prophet Time Series Modeling: A predictive time series algorithm developed by Facebook that is intuitive and easy to use. Its main strengths are handling missing data and changes in trends, robustness to outliers and fast adaptation.
Autoregressive Integrated Moving Average (ARIMA): a statistical method for forecasting time series data that models the correlation between the data and its lagged values. ARIMA can handle a variety of time series data, but it can be more difficult to implement than some other methods.
Exponential Smoothing: A method for forecasting time series data that uses a weighted average of past data to make predictions. Exponential smoothing is relatively easy to implement and can be used with a variety of data, but it may not perform as well as more complex methods.
4-Ensemble learning-Ensemble algorithms are machine learning techniques that combine predictions from multiple models to make more accurate predictions than any single model. There are several reasons why ensemble algorithms can outperform traditional machine learning algorithms.
Algo-
Random Forest: a machine learning algorithm that generates a set of decision trees and makes predictions based on the majority of trees.
XGBoost: A type of gradient boosting algorithm that uses decision trees as its base model and can be one of the strongest predictive ML algorithms.
LightGBM: Another type of gradient boosting algorithm designed to be faster and more efficient than other boosting algorithms.
CatBoost: A gradient boosting algorithm specially designed to handle categorical variables well.
5-Cluster algo-Clustering algorithms are an unsupervised learning task and are used to group data into "clusters". Unlike supervised learning, where the target variable is known, clustering does not have a target variable.
This technique is useful for finding natural patterns and trends in data and is often used in the data exploration phase to better understand the data. In addition, clustering can be used to divide data into separate segments based on different variables. A common application of this is customer or user segmentation.
Algo-
K-mode clustering: a clustering algorithm designed specifically for categorical data. It can handle very advanced categorical data and is relatively easy to implement.
DBSCAN: a density-based clustering algorithm that can identify clusters of arbitrary shape. It is relatively robust to noise and can detect anomalies in the data.
Spectral Clustering: a clustering algorithm that uses the eigenvectors of a similarity matrix to group data points into clusters. It can handle non-linearly separable data and is relatively efficient.
6-Similarity algo-Similarity algorithms are used to measure the similarity between pairs of records, nodes, data points or text. These algorithms can be based on the distance between two data points (e.g. Euclidean distance) or text similarity.
Algo-
Euclidean distance: a measure of the straight-line distance between two points in Euclidean space. Euclidean distance is easy to calculate and is widely used in machine learning, but it may not be the best choice in situations where the data is not uniformly distributed.
Cosine Similarity: A measure of similarity between two vectors based on the angle between them.
Levenshtein Algorithm: An algorithm to measure the distance between two strings based on the minimum number of modifications (insertions, deletions or substitutions) of a single character needed to convert a single string. The Levenshtein algorithm is often used for spell checking and string matching.
Jaro-Winkler Algorithm: An algorithm to measure the similarity between two strings based on the number of matching characters and the number of transpositions. It is similar to Levenshtein's algorithm and is often used to link records and resolve items.
Singular Value Decomposition (SVD): A matrix decomposition method that decomposes a matrix into the product of three matrices - an important component of modern recommendation systems.