When Is a Semi-Supervised Machine Learning Useful?

Data science attempts to create an artificial intelligence to assist with import decisions. Overall, intelligence is established by consuming either pre-learnt knowledge or large amount of raw data. The artificial intelligence is the intelligence learnt from raw data. Therefore, artificial intelligence usually comes hand-in-hand with big-data. But in real commercial environment, it could be either time-consuming or too expensive to collect large amount of data, especially for start-up companies. Therefore, important questions for data scientists in start-up companies are:

1.   Which type of data is less time-consuming and cheaper to be collected?

2.   How to build effective and predictive models by optimally leveraging the data that can be collected?


Nowadays, data can be automatically and cheaply collected from software people use everyday. The most difficult data to be collected is the data that requires extra attention from people and has no direct UTILITY (an economics term). For example, people could feel unmotivated and even uncomfortable to tell the machine whether he/she likes a movie, a restaurant or an advertisement. If we investigate one step further, we would find that it is the data which “supervises” machine’s learning process that is in short supply. Therefore, the supervised machine learning, which defines a clear objective function for machine to learn and achieve, is more expensive to train. 


Unsupervised learning naturally comes in not only because these data are much easier to be collected but also because it can enable representation learning. The representation learning can be very effective if data for unsupervised learning is accumulated to much larger scale than supervised data. In best scenario, the key concept in deep learning, Distributed Representation, can be learnt in unsupervised learning, which would be equivalent to or even better than manually crafted features by human experts. Combining this unsupervised learning with supervised learning would drive us naturally to a solution in the field of semi-supervised machine learning. Moreover, if we rethink about how people learn, it is closer the the principle of semi-supervised learning, which is a mix of “teaching” and “self-learning”.


I performed experiments based the data on my hand. In my regression model settings, I have data with 300 raw features and 740 observations, which is extremely small data set for machine learning. By using straightforward supervised learning, the best achieved r-squared value is 24% after considering regularised linear models, random forest, gradient boosting regularised tree models, support vector machine and deep learning models. At the same time, I also have another 23,000 observations which have no supervised information at all (i.e. only X is available but not Y). After I performed an unsupervised learning before I trained a supervised model I obtained r-squared value 35%. This is based on the unsupervised learning method auto-encoder in Keras (TensorFlow as Backend). This turned out to improve accuracy by around 50%. I also tried Generalised Low Rank Model (GLRM), which helps to achieve 34% r-squared, only marginally lower than auto-encoder method. I would also like to try Restricted Boltzmann Machine (RBM), which is expected to be able to improve model performance too.


This experiment until now drove me to believe that, unsupervised learning could be highly useful if we have much larger data size in terms of features (X) than supervised data (Y). It’s highly likely that semi-supervised model would significantly outperform directly supervised model in this case. Furthermore, my other experiments showed that it is typically unnecessary to train an unsupervised model if you have basically equal observation size about X and Y. A representation learnt from unsupervised model is not necessarily beneficial to the later objective-driven supervised learning.

To view or add a comment, sign in

More articles by Simon Xing Zhao

Others also viewed

Explore content categories