What is so “deep” about Deep Learning
Recently my interest in AI and deep learning was rekindled when Andrew Ng announced his new specialization on Coursera. After completing the first course of the sequence (which I enjoyed very much), I’d like to share some thoughts. My PhD supervisor, Prof. Winfried Pohlmeier, during one of our lunch breaks some 15 years ago exclaimed “These artificial neural networks are just a fancy name for non-linear econometric models” – he was very right about this, at least when it comes to supervised learning.
In an interview with Andrew Ng, Geoffrey Hinton mentioned that basically back propagation and the whole “learning” bit is not much more than taking first derivatives of various functions to apply gradient descent – ideas that have been around for a very long time (the idea seems to trace at least as far back as an 1863 paper from Bernhard Riemann). Next time someone claims “But I will never use linear algebra and (matrix) calculus in my job” you can easily prove them wrong. 😊
Andrew Ng also jokingly reflected that deep learning became rather popular after it was rebranded as “deep” – before that it was just a boring “neural networks with many hidden layers” – yawn. We like “deep” – it’s just cool.
It’s also quite amazing to realize that one of the recent “breakthroughs” in deep learning was to replace sigmoid functions (e.g. a logistic function) with a so-called rectified linear unit (ReLU) which is simply f(x) = max(0, x), having the nice property that its derivative is equal to one and thus “not close” to zero for any x > 0 (as opposed to sigmoid functions like logistic or tanh whose derivative tends to zero for small/large x). Having a function with this property makes learning (aka gradient descent) faster. It is interesting to reflect that such simple ideas can have a big impact.
Hasn’t it been for the millions (maybe billions?) pictures of cats out there and the rapid advance in computing power, we would still be doing old-fashioned logistic regressions and “estimate parameters” using “maximum likelihood” rather than “deep learn” using “back prop”. 😊
Great article, Valeri Voev! I very much agree with it! The real reason for the deep learning hype really lies with the accelerated optimization/learning due to the significant improvement in hardware(gpu), and not with the theory itself.
I should also mention that the above cat image hasn't been published before and I'd like to donate it to the image recognition community 🙂
Winfried Pohlmeier, Ionut Alexandru Militaru, Ricko W. H. Clausen, Anders Gottfred Aaen