Rules of thumb for Machine Learning
Machine learning is the process of a machine attempting to accomplish a task, independent of human intervention, more efficiently and more effectively with every passing attempt i.e learning phase. Machine Learning is a sub-set of artificial intelligence where computer algorithms are used to autonomously learn from data and information. Machine learning has been around for a longer time relative to our capability to handle large data sets. Regression models have been around for more than centuries while artificial neural network were developed in the 1940s. During those days there were shortage of data, data collection and storage of data was unreliable and computation was super expensive. But as time passed by and new technology were emerged these tasks now seems to be easy.
In this article, I have listed out set of rules-of-thumb which are helpful in addressing the challenges associated with machine learning. But before describing the set of rules, there are some metrics that describe the dataset. Unstructured data such as images, text or video need to be converted into data frames before applying predictive methods.
Rule 1 : Number of columns or attributes or features should be greater than number of classes for statistically significant model.
Rule 2 : Number of rows or examples should be greater than number of attributes or columns.
Rule 3 : The ratio of number of samples of minority class to number of samples of majority class should be somewhat equal to 0.5. This constraint is the hardest constraint to ensure because of data imbalance problem or low-signal-to-noise problem. A simple way to meet this rule is by under sampling the majority class.
Rule 4 : This is not much of a rule. If we add new dataset or feature or new class will it be still applicable to our model. Yes it will be applicable but all we need to do is retrain or relearn the model as the new data accumulates.