Some Practical Tips of Applying Deep Learning to Your Own Problems
More and more companies are building and applying deep learning models in their business. Several practical issues should be taken into consideration before these models are put into production. Consider this scenario: you may build a model that works perfectly with training and validation data, but it doesn’t perform well after deploying the model in real scenarios. Or, you may struggle with getting better performance compared to traditional machine learning models. While the latter case will make you rethink whether to invest more resourcing on this, the former situation is more risky and you may not realize it until you put your models into production.
To build a robust deep learning model, it can be much more than training or fine tuning some existing models (e.g. inception-v3, resnet, LSTM, etc.) with your own dataset. These winning models are your best friend and can usually serve as base models. However, even for the most well-studied problems such as image classification or object detection, fine tuning these models could just be the first step. There are further important issues that should be considered.
The first and most important thing you should think about is whether the distribution of data you have in model development will be the same as the distribution of data in real scenario, regardless how much data you have. I remember in one of Andrew Ng’s talks I attended a year ago he emphasized that your test data distribution should match your validation and training data in order for your deep learning models to work well (or work as expected). This makes sense but it may not be the case in real world. It’s possible that you do not even know what distribution of the real data will be, and the distribution could also be changing frequently as well.
I’ll take our crop disease identification research for example. We are diagnosing disease in real-time from pictures taken in the field. There are two main sources that could cause the distribution differences. The first one is how people take pictures. The way your target users take pictures may be different from the way those training pictures are taken. We know many convolutional models are not good at handling scale and rotation variances. If the variances are large, it will be a problem that could not be resolved by data augmentation, or by utilizing a transformation-based network such as spatial transformer network or deformable convolutional layers. One simple and efficient way to tackle this could be increasing the diversity of collected data and restrict the way user taking pictures.
The second, more challenging source of distribution difference is only having access to data from a subset of categories (which cannot represent the whole population). In the disease identification example, when we collect corn disease images, the images are only from a certain types of corn hybrids (type of seeds). The disease symptoms may look very different from hybrid to hybrid. Can we generalize our model to most or even all hybrids?
The answer is yes, but probably with some domain knowledge. The data distributions are different in raw image space does not mean they have to be different in other spaces. If we can find some latent space that our training data can be a good representation of the general population, we can just build models in that latent space. Another way to tackle this is to make the learned features being more general instead of being specific to several hybrids or specific to your dataset. This can be done by enforcing certain regularizations or partially engineering the features rather than letting the model learn the features all by itself.
Another interesting issue can be the necessity of adding extra classes. For example, we like to have healthy-leaf class and non-leaf class in addition to different diseases. The non-leaf class can be trained with a diverse of non-leaf pictures (you may be careful selecting these images). However, the healthy-leaf class could be more tricky than you think. The reason is that it is not an exclusive class. Being more specific, diseased leaves may also contain tissues that are healthy. The way softmax classifier works is discriminative and exclusive, i.e., the one with highest score will be selected as the final class. Whenever the healthy parts activate neurons more than the diseased parts do, the leaf will be classified as healthy.
Detection frameworks could be a good solution for the above issue. This leads to another question, detection vs. classification. A detection framework does allow you to identify and locate multiple objects in an image, but it’s also a slightly harder problem so the accuracy may not be as good. Moreover, many popular frameworks today like faster R-CNN, single shot detection (SSD), R-FCN, or YOLO, do require ground truth of bounding boxes, which might be costly to have. Besides, unlike objects in dataset like ImageNet, COCO, many objects may not have a well defined shape. It makes less sense to model the conditional probability of a bounding box given part of the object (a subwindow or proposal in an image). There are certainly ways to work around this but the work and effort may be more complex.
Some less critical but also common challenges include: what if the diversity is too large within a class? Does it benefit to divide further into subclasses so the features are more consistent? What if some classes have much fewer training samples? What’s the best way to manage cases when certain misclassification is less tolerant than others? Those are all practical issues to consider.
The recent popularity of generative adversarial networks (GANs) and its improved variants like Wasserstain-GAN provide us a new way to model data distribution and leverage large amount of unlabeled data. Though semi-supervised learning does require further research, it could be helpful to try out these models to see if the unlabeled data does boost the performance of your model.
While convolutional models are successful in extracting useful features across different vision problems, this is not the case for sequential data. Sequential data is more problem dependent and each type of data may have its own characteristics. LSTM achieving good results in language processing does not necessarily lead to success for other types of sequence data, and it may not be helpful to add more hidden nodes or stack multiple LSTM layers. Getting the right architecture may be your first critical and time consuming step.
There is a lot of buzz right now about AI and people are interested in trying out deep learning models for different problems. However, to solve a practical problem, always try your best using traditional machine learning methods (unless they are classic DL problems). In many cases it can give you better insight whether deep neural network is a promising direction to go with, and it can also serve as a baseline model. If your traditional model is equally good and the features are more interpretable, you do not want to deploy a black-box (or grey-box) model just for marketing purposes. Or if you do, you may want to make sure your DL model is generalizable and as efficient as your traditional ML models.
Also posted at company tech blog with minor changes here.
Thanks Wei for sharing your insights with us. Great article!