Deep Learning, Deep Reasoning & Dense Models
Many recent advances in computer science have been driven by the convergent availability of large numbers of data and of fast machines on which to analyse them. This availability has enabled us to acquire implicit partial models of the underlying generators for the data and apply those models to tasks such as translation, transcription, and image captioning. To date, though, few if any of these models have been dense, in the sense of thoroughly modelling some aspect of the world in way that can facilitate any relevant task. Dense models should at least support:
- Prediction: What might happen next in this situation, or what might be true in the vicinity?
- Interpolation: What may have happened between these situations? What might be located between these things?
- Causal reasoning: Why did this happen?
- Purpose reasoning: What is this configuration of things for? For what purpose is that happening?
- Task performance: The model should be able to aid (e.g.) a robot or on-line agent performing a domain task.
- Explanation: The model should be at a level that supports communication.
In short, a dense model is the sort of model — including both implicit and explicit components — humans form about aspects of their worlds: aspects like meetings, plants, lawnmowers, rivers and kitchens. These models support pretty-much any kind of relevant reasoning.
These dense models are also the sorts of models that builders of large-scale “common-sense” knowledge bases (KBs) have been working to construct. But, to date, although some such knowledge bases support particular instances of each kind of reasoning task, they do not do so comprehensively, even within quite narrow domains. Cyc, which Doug Lenat, the team at Cycorp, and I have been building for many years now, and which Lucid is commercialising, comes closest; but it remains incomplete. Although some work is being done on automating KB construction, this work generally aims at breadth, rather than depth and density.
Can YouTube be the Font of All Knowledge Bases?
Similarly, although machine vision and NLP researchers have long discussed using background knowledge in scene and text understanding, demonstrating that utility in any general way has been hampered by the vast incompleteness of available KBs.
We are not frequently surprised by what we find in a kitchen, or by what happens there.
The time is ripe for a 5-10 year AI challenge problem in production of dense models directly from data and pre-existing background knowledge bases. As a particular example, kitchens are somewhat limited in complexity, from a human point of view, and are densely modelled by most humans; we are not frequently surprised by what we find in a kitchen, or by what happens there. And we are not lacking for data; there are more than 6 million YouTube hits for “kitchen”, and around 5 million for cooking. If each were a mere 1 minute long, this represents 22 years of kitchen video. Dull perhaps, but also, presumably, enough grist for building a very dense model indeed. The proposed challenge is this: to have computers automatically build, from just the vast amount of video found on the web, a sufficiently dense local world model to enable that video to be thoroughly understood for prediction, interpolation, explanation and the other tasks that count, in humans, as understanding.
This post is based on a talk I gave recently at Carnegie Mellon University, very kindly hosted by Dr's Alex Hauptman and Florian Metze and inspired by work with the SRI Sarnoff vision group, Cycorp, and by ancient work with the CMU Informedia and Sphinx Groups, all of whose members have my deep gratitude. This idea, like many others, was incubated at a seminar at Schloss Dagstuhl at which the organisers and the German government were wonderfully accommodating.
In your Gaia paradigm, at what stage does such a constraint get 'confirmed', and is it at least soft? E.g.: D: Robot, we have Egg Beaters in the fridge, so please use that for my omelette. R: I'm sorry Dave, I can't do that. I can't make an omelette if I don't break any eggs.
Vijay Saraswat I'm not sure, yet, what it means to do that sort of diagnosis and repair in sophisticated professional domains. One idea does occur to me though. I've thought for some time now that the virtual environments of video games, while impoverished by any measure compared to the real world, do provide a rich enough environment that a real agent might have a complex "life" within them. Certainly rich enough to serve as a test-bed for learning by observation and experimentation. The work from Google DeepMind only serves to reinforce this point with respect to very simple environments. The idea then, is to co-evolve the agent and the world, so that the world evinces the environmental behaviours that are required for learning. From the computer learner's point of view, there is little to distinguish the world as something that has behaviours that must be modelled from the agent. The world, then, would learn to challenge the agent, while still accurately reproducing the behaviour of the "real" world (as found in training data). In DeepMind "breakout" case, there would be two learning tasks. 1) Learn to maximise the score, by controlling the paddle position input. 2) Learn to produce the sequence of screen images in the breakout game, given the paddle position input and the screen image history. Now, 2 will be challenging, but not, I think, ludicrously challenging (breakout, is, afterall, a simple world IF one understands that the world is made of walls, spaces, and bricks with uniform behaviour). And learning that structure is the essence of the Dense Model task. An agent-world learning pair of this sort would learn both the agent (the paddle twitcher) and the "world agent" (I'm tempted to assign the name "Gaia" here [General Acquired Interactive Ambiance]). Now, in the case of breakout, of course, we don't need this Gaia in which to practice, because we *have* a breakout program. But usually we don't. RoboWatch might benefit mightily from a virtue stovetop, virtual eggs, and virtual pans and spatulas with which to practice making omelettes. In RoboWatch, the robot learns that breaking eggs is a first step in making omelettes, but it would be up to the Gaia learner, using the same videos and the robot's actions, to learn to enforce the fact that in the real world, you can't make an omelette without breaking eggs. This would allow the robot to confirm such a hypothesis by experimentation. So, that's two steps of generalisation. Of course the worlds that exercise professional competence are even more complex, and the learning tasks even more challenging, but I think that the notion of exploiting the isomorphism between learning to act in the world, and learning to be the world applies as strongly here.
This seems to be happening more quickly than even I would have expected: the RoboWatch project http://robo.watch/#sec1 is learning tasks from YouTube ; here's a story that explains what they are doing http://www.extremetech.com/extreme/220110-robots-can-now-learn-household-tasks-by-watching-youtube-videos and connects it to work at Carnegie Mellon University (West) on video summarization (https://www.cmu.edu/silicon-valley/research/smartspaces/). This is separate from the work at CMU in Alex Hauptmann's group that inspired the Dense Model Challenge in the first place.
Michael, there is one important element that I think you miss. I believe it is also very important for us to start developing mechanized theories for self-diagnosis and self-repair in deep domains (and for a variety of tasks), for learning how to learn from the get-go. That is, for bots (built on deep domain models) to start performing the variety of tasks you describe at high levels of competence it is critical that they be able to operate in the real world (e.g. in a safe environment such as test-taking), make mistakes and learn from these mistakes e.g. by fixing elements of their theory, by adjusting probabilities, by determining there are concepts they know nothing about (hence need to go hit the books / webpages etc to find out more about), there are procedures that they need to learn, etc etc. Got to close the feed-back loop and build a controller in that can help improve performance over time For this, some form of reflection, some form of knowledge about what is known and how it is known, and what is not known should be critical, together with procedures to diagnose and repair. In the limited context of reinforcement learning (world is a POMDP and an unambiguous reward signal is available and the domain of operation is extremely limited) as Gerry Tesauro showed in the early 90s and the Deep Mind people showed more recently, the right learning algorithm applied over millions of interactions (e.g. via self-play) can dramatically improve performance. But how do we apply these ideas to the setting of learning professional-level competence on a variety of cognitive tasks over deep domains? What aspects of the structure of the domain knowledge and of the tasks involved need to reflectively reified so that the bot can reason about causal chains, track dependencies, assign blame, now how to repair -- automatically? Is there work out of CYC that bears on this?
very good idea. the economics would work better, however, if one were to choose a domain in which robots are now starting to enter. e.g. the management of a warehouse or a greenhouse/vineyard. this would give you a closed-loop system where the model extracted from the video feed could immediately be put to test by the actuators on the robots. gains in the management of the {ware|green}house would pay for the construction of the model in a positive feedback loop. once you are done modelling environment X, you can use the lessons learned to move to the next domain Y where the economics equally make sense (self driving cars, perhaps?). once you have done this for two domains X and Y you can start exploring industrial processes that can be automated *across* the two domains and so on, until what used to be separate island of densely modelled parts of the world get all connected with one another, each time generating enough gains/savings to justify the investment.