Deep Learning: Alchemy or Electricity?
“AI is the new electricity.” - Andrew Ng
In NIPS “Test of Time” award talk this year Ali Rahimi said that AI is more like alchemy, see the video here. This prompted a rebuttal by Yann LeCun in a Facebook post. See the post here.
Ali was lamenting the lack of rigor in current deep learning work where empirical trial-and-error work dominates. Practitioners use tricks and techniques, for example, batch normalization or dropout, without rigorous understanding why those work. Yann came to the defense of deep learning and asserted that often applications get first built before one gains theoretical understanding.
I called up my daughter and asked if that is true in physics which she is studying in college. Superconductivity and photoelectric effect were discovered before theories explaining them were developed. Einstein’s amazing E = mc2 came much before nuclear energy. Faraday and Maxwell were contemporaries working almost same time. In computer science, Alan Turing’s great foundation work in theory came before computers. Computational complexity and algorithms came after computers.
In deep learning, why is theory lagging behind practice?
Because many more resources are devoted for the latter rather than the former. With industry now leading AI work, there is significant interest in practical applications and if black box techniques work, then fine. Why bother about theory? Ali is right that this is indeed the current state of affairs.
And it is going to be some non-trivial effort to build theory. Deep learning works in mathematical spaces which have enormously large number of dimensions. These monsters are non-intuitive, defy our common sense, and which we have just begun to get glimpse of. One is reminded of spin glasses in physics and of tiling problems in computer science. These problems on 2-D lattices with constraints and rules on local couplings often lead to computational intractability and undecidability. In fact, there has been recent work by Yann LeCun and other researchers in applying results for spin glasses to deep networks.
Source: Screenshot from talk by Jarkko Kari titled "Tilings and Undecidability in Cellular Automata." The image shows an evolution step of an Ising Model using a cellular automaton starting from a random spin configuration. Spin-glasses, Ising Models, Cellular Automata, Tilings, Traveling Salesman Problems, all exhibit computationally intractable behavior. See the presentation here.
Deep learning seems to be analogous to approximate algorithms to NP-hard problems. They work well, though rigorous analysis of optimization landscapes of such problems has a long way to go. In deep learning, these approximations happen to be largely data-driven aided by some tricks discovered experimentally.
It came as a surprise to people that local minima of large neural networks for useful tasks in AI aren’t problematic when you are optimizing in such mysterious spaces. Local minima were big deal in neural networks in last century. We now have better understanding. As you are running Stochastic Gradient Descent, you slide down the optimization landscape towards seemingly large bottom, unlikely to be hampered by local minima along the way as there are just too many dimensions for you to get trapped from all directions and therefore you will almost always find some direction to continue your slide towards a good enough minima. You may slow down at times, but due to stochastic nature, you will get lucky eventually as you escape through some of the open doors and continue your march towards the valley.
All this is governed by hyperparameters and tricks which pose some problem. We don’t rigorously grasp how they all interact with each other, with the underlying complexity of the task, and with the training data size, and how we can quickly select them to give us very best results, leaving us with no choice but to invest time in experiments. You may have to babysit the process though. If we have fast enough hardware or enough budget for cloud, we can spawn several of these experiments primarily guided by our past experience, thereby gradually getting closer to a solution we feel happy about. You can do inference then, which again occurs in high-dimensional spaces, transforming points belonging to high-dimensional manifolds of cats and dogs with complex geometries into probability values.
Mathematicians should now start rolling up their sleeves as there is clear need to come up with theorems. And, hopefully here academia will take the lead.
As deep neural networks are becoming more mainstream, work is already underway to overcome their weaknesses. Max pooling gives limited local translation invariance. Parts of an object can ziggle little bit and that will be still okay. And, deep network layers build hierarchical representation of an object, from parts to whole. We all understand that well. But, how to incorporate general pose invariance? Recent novel work by Geoffrey Hinton on capsule networks is a step in that direction. This is very cool because improvements and understanding are like twins which grow up together supporting each other.
Ali Rahimi made a great point highlighting need for rigor and better understanding of how deep networks work. And, Andrew Ng is right about wide applications of AI. There is no conflict in their viewpoints. AI will turn out to be a very successful engineering field which will inspire mathematicians to provide rigor to the field.
It is likely that practice and theory, applications and insights, will progress hand-in-hand in future. We have Faradays of AI. In future, there will be Maxwells. Yes, though we can barely visualize dimensions more than 3 or 4, we can still tame millions of dimensions through the magic of math and science, two great creations of human mind.
Update: See follow up post "New results in Deep Learning: Making AI recognize Taj Mahal consistently".