Machines are learning to process simple commands by exploring 3-D virtual worlds.
Devices like Amazon’s Alexa and Google Home have brought voice-controlled technology into the mainstream, but these still only deal with simple commands. Making machines smart enough to handle a real conversation remains a very tough challenge. And it may be difficult to achieve without some grounding in the way the physical world works.
Attempts to solve this problem by hard-coding relationships between words and objects and actions requires endless rules, making a machine unable to adapt to new situations. And efforts to have machines learn about language for themselves normally require substantial human assistance.
Now teams at DeepMind, an AI-focused subsidiary of Alphabet, and Carnegie Mellon University have developed a way for machines to figure out simple principles of language for themselves inside 3-D environments based on first-person shooter computer games.
“Being able to do this in 3-D is definitely a significant step closer to being able to do it in the real world,” says Devendra Chaplot, a master’s student at CMU who will present his paper at the annual meeting of the Association for Computational Linguistics. The ultimate goal, he says, is creating a simulation so close to real life an AI trained in it can transfer what it learns into the real world.
Both the DeepMind and CMU approaches use deep reinforcement learning, popularized by DeepMind’s Atari-playing AI. A neural network is fed raw pixel data from a virtual environment and uses rewards, like points in a computer game, to learn by trial and error (see “10 Breakthrough Technologies 2017: Reinforcement Learning”).
Normally, the goal might be to achieve a high score inside the game, but here the two AI programs were given commands like “go to the green pillar” and then had to navigate to the correct object to receive rewards.
By running through millions of training scenarios at accelerated speeds, both AI programs learned to associate words with particular objects and characteristics, which let them follow the commands. They even learned to understand relational terms like “larger” or “smaller” to discriminate between similar objects.
Most important, both programs could “generalize” what they learned to unseen situations. If training scenarios contained pillars and also red objects, they could carry out the command “go to the red pillar” even if they had never seen one in training.
This makes them much more flexible than previous rule-based systems, says Chaplot. The CMU team mixed the visual and verbal input in a way that focused the AI’s attention on the most relevant information, while DeepMind gave their system extra learning objectives—like guessing how its view will change as it moves—that boosted its overall performance. Since the two approaches tackle the problem from different angles, Chaplot says combining them could provide even better performance.
The DeepMind researchers didn’t respond to requests for comment.
“These papers are preliminary, but I think it’s pretty exciting to see the progress they are making,” says Pedro Domingos, a professor at the University of Washington and the author of The Master Algorithm, a book about different machine-learning methods.
The research follows a trend in AI that involves bringing together hard problems like language and robotic control. Counterintuitively, this makes both challenges easier, he says. That’s because understanding language is easier if you can access the physical world it refers to, and learning about the world is easier with some guidance.
The millions of training runs required means Domingos is not convinced pure deep reinforcement learning will ever crack the real world. He thinks DeepMind’s AlphaGo, often held up as a benchmark of AI progress, actually shows the importance of incorporating a variety of AI approaches.
Michael Littman, a professor at Brown University who specializes in reinforcement learning, says the results are “impressive” and the visual input is far more challenging than that used in earlier work. Most previous attempts to use simulators to ground language have been restricted to simple 2-D environments, he notes.
But Littman echoes Domingos’s concerns about the approach’s real-world scalability and points out that the commands are generated in a formulaic way based on goals set by the simulator. That means they’re not really representative of the imprecise and contextual commands humans are likely to give machines in real life.
“I’m worried that people might look at the examples of the network responding intelligently to verbal commands and extrapolate that these networks understand language and navigation much more deeply than they actually do,” says Littman.