The foundation of efficient robot learning

doi:10.1126/science.aaz7597

GSTDTAP > 气候变化

DOI	10.1126/science.aaz7597
	The foundation of efficient robot learning
	Leslie Pack Kaelbling
	2020-08-21
发表期刊	Science
出版年	2020
英文摘要	The past 10 years have seen enormous breakthroughs in machine learning, resulting in game-changing applications in computer vision and language processing. The field of intelligent robotics, which aspires to construct robots that can perform a broad range of tasks in a variety of environments with general human-level intelligence, has not yet been revolutionized by these breakthroughs. A critical difficulty is that the necessary learning depends on data that can only come from acting in a variety of real-world environments. Such data are costly to acquire because there is enormous variability in the situations a general-purpose robot must cope with. It will take a combination of new algorithmic techniques, inspiration from natural systems, and multiple levels of machine learning to revolutionize robotics with general-purpose intelligence. Most of the successes in deep-learning applications have been in supervised machine learning, a setting in which the learning algorithm is given paired examples of an input and a desired output and it learns to associate them. For robots that execute sequences of actions in the world, a more appropriate framing of the learning problem is reinforcement learning (RL) ([ 1 ][1]), in which an “agent” learns to select actions to take within its environment in response to a “reward” signal that tells it when it is behaving well or poorly. One essential difference between supervised learning and RL is that the agent's actions have substantial influence over the data it acquires; the agent's ability to control its own exploration is critical to its overall success. The original inspirations for RL were models of animal behavior learning through reward and punishment. If RL is to be applied to interesting real-world problems, it must be extended to handle very large spaces of inputs and actions and to work when the rewards may arrive long after the critical action was chosen. New “deep” RL (DRL) methods, which use complex neural networks with many layers, have met these challenges and have resulted in stunning performance, including solving the games of chess and Go ([ 2 ][2]) and physically solving Rubik's Cube with a robot hand ([ 3 ][3]). They have also seen useful applications, including energy efficiency improvement in computer installations. On the basis of these successes, it is tempting to imagine that RL might completely replace traditional methods of engineering for robots and other systems with complex behavior in the physical world. There are technical reasons to resist this temptation. Consider a robot that is designed to help in an older person's household. The robot would have to be shipped with a considerable amount of prior knowledge and ability, but it would also need to be able to learn on the job. This learning would have to be sample efficient (requiring relatively few training examples), generalizable [applicable to many situations other than the one(s) it learned], compositional (represented in a form that allows it to be combined with previous knowledge), and incremental (capable of adding new knowledge and abilities over time). Most current DRL approaches do not have these properties: They can learn surprising new abilities, but generally they require a lot of experience, do not generalize well, and are monolithic during training and execution (i.e., neither incremental nor compositional). How can sample efficiency, generalizability, compositionality, and incrementality be enabled in an intelligent system? Modern neural networks have been shown to be effective at interpolating: Given a large number of parameters, they are able to remember the training data and make reliable predictions on similar examples ([ 4 ][4]). To obtain generalization, it is necessary to provide “inductive bias,” in the form of built-in knowledge or structure, to the learning algorithm. As an example, consider an autonomous car with an inductive bias that its braking strategy need only depend on cars within a bounded distance of it. Such a car's intelligence could learn from relatively few examples because of the limited set of possible strategies that would fit well with the data it has observed. Inductive bias, in general, increases sample efficiency and generalizability. Compositionality and incrementality can be obtained by building in particular types of structured inductive bias, in which the “knowledge” acquired through learning is decomposed into factors with independent semantics that can be combined to address exponentially more new problems ([ 5 ][5]). The idea of building in prior knowledge or structure is somewhat fraught. Richard Sutton, a pioneer of RL, asserted ([ 6 ][6]) that humans should not try to build any prior knowledge into a learning system because, historically, whenever we try to build something in, it has been wrong. His essay incited strong reactions ([ 7 ][7]), but it identified the critical question in the design of a system that learns: What kinds of inductive bias can be built into a learning system that will give it the leverage it needs to learn generalizable knowledge from a reasonable amount of data while not incapacitating it through inaccuracy or overconstraint? There are two intellectually coherent strategies for finding an appropriate bias, with different time scales and trade-offs, that can be used together to discover powerful and flexible prior structures for learning agents. One strategy is to use the techniques of machine learning at the “meta” level—that is, to use machine learning offline at system design time (in the robot “factory”) to discover the structures, algorithms, and prior knowledge that will enable it to learn efficiently online when it is deployed (in the “wild”). The basic idea of meta-learning has been present in machine learning and statistics since at least the 1980s ([ 8 ][8]). The fundamental idea is that in the factory, the meta-learning process has access to many samples of possible tasks or environments that the system might be confronted with in the wild. Rather than trying to learn strategies that are good for an individual environment, or even a single strategy that works well in all the environments, a meta-learner tries to learn a learning algorithm that, when faced with a new task or environment in the wild, will learn as efficiently and effectively as possible. It can do this by inducing the commonalities among the training tasks and using them to form a strong prior or inductive bias that allows the agent in the wild to learn only the aspects that differentiate the new task from the training tasks. Meta-learning can be very beautifully and generally formalized as a type of hierarchical Bayesian (probabilistic) inference ([ 9 ][9]) in which the training tasks can be seen as providing evidence about what the task in the wild will be like, and using that evidence to leverage data obtained in the wild. The Bayesian view can be computationally difficult to realize, however, because it requires reasoning over the large ensemble of tasks experienced in the factory that might potentially include the actual task in the wild. Another approach is to explicitly characterize meta-learning as two nested optimization problems. The inner optimization happens in the wild: The agent tries to find the hypothesis from some set of hypotheses generated in the factory that has the best “score” on the data it has in the wild. This inner optimization is characterized by the hypothesis space, the scoring metric, and the computer algorithm that will be used to search for the best hypothesis. In traditional machine learning, these ingredients are supplied by a human engineer. In meta-learning, at least some aspects are instead supplied by an outer “meta” optimization process that takes place in the factory. Meta-optimization tries to find parameters of the inner learning process itself that will enable the learning to work well in new environments that were drawn from the same distribution as the ones that were used for meta-learning. Recently, a useful formulation of meta-learning, called “model-agnostic meta-learning” (MAML), has been reported ([ 10 ][10]). MAML is a nested optimization framework in which the outer optimization selects initial values of some internal neural network weights that will be further adjusted by a standard gradient-descent optimization method in the wild. The RL2 algorithm ([ 11 ][11]) uses DRL in the factory to learn a general small program that runs in the wild but does not necessarily have the form of a machine-learning program. Another variation ([ 12 ][12]) seeks to discover, in the factory, modular building blocks (such as small neural networks) that can be combined to solve problems presented in the wild. The process of evolution in nature can be considered an extreme version of meta-learning, in which nature searches a highly unconstrained space of possible learning algorithms for an animal. (Of course, in nature, the physiology of the agent can change as well.) The more flexibility there is in the inner optimization problem solved during a robot's lifetime, the more resources—including example environments in the factory, broken robots in the wild, and computing capacity in both phases—are needed to learn robustly. In some ways, this returns us to the initial problem. Standard RL was rejected because, although it is a general-purpose learning method, it requires an enormous amount of experience in the wild. However, meta-RL requires substantial experience in the factory, which could make development infeasibly slow and costly. Thus, perhaps meta-learning is not a good solution, either. What is left? There are a variety of good directions to turn, including teaching by humans, collaborative learning with other robots, and changing the robot hardware along with the software. In all these cases, it remains important to design an effective methodology for developing robot software. Applying insights gained from computer science and engineering together with inspiration from cognitive neuroscience can help to find algorithms and structures that can be built into learning agents and provide leverage to learning both in the factory and in the wild. A paradigmatic example of this approach has been the development of convolutional neural networks ([ 13 ][13]). The idea is to design a neural network for processing images in such a way that it performs “convolutions”—local processing of patches of the image using the same computational pattern across the whole image. This design simultaneously encodes the prior knowledge that objects have basically the same appearance no matter where they are in an image (translation invariance) and the knowledge that groups of nearby pixels are jointly informative about the content of the image (spatial locality). Designing a neural network in this way means that it requires a much smaller number of parameters, and hence much less training, than doing so without convolutional structure. The idea of image convolution comes from both engineers and nature. It was a foundational concept in early signal processing and computer vision ([ 14 ][14]), and it has long been understood that there are cells in the mammalian visual cortex that seem to be performing a similar kind of computation ([ 15 ][15]). It is necessary to discover more ideas like convolution—that is, fundamental structural or algorithmic constraints that provide substantial leverage for learning but will not prevent robots from reaching their potential for generally intelligent behavior. Some candidate ideas include the ability to do some form of forward search using a “mental model” of the effects of actions, similar to planning or reasoning; the ability to learn and represent knowledge that is abstracted away from individual objects but can be applied much more generally (e.g., for all A and B, if A is on top of B and I move B, then A will probably move too); and the ability to reason about three-dimensional space, including planning and executing motions through it as well as using it as an organizing principle for memory. There are likely many other such plausible candidate principles. Many other problems will also need to be addressed, including how to develop infrastructure for training both in the factory and in the wild, as well as methodologies for helping humans to specify the rewards and for maintaining safety. It will be through a combination of engineering principles, biological inspiration, learning in the factory, and ultimately learning in the wild that generally intelligent robots can finally be created. 1. [↵][16]1. A. Barto, 2. R. S. Sutton, 3. C. W. Anderson , IEEE Trans. Syst. Man Cybern. 13, 834 (1983). [OpenUrl][17][CrossRef][18][Web of Science][19] 2. [↵][20]1. D. Silver et al ., Science 362, 1140 (2018). [OpenUrl][21][Abstract/FREE Full Text][22] 3. [↵][23]OpenAI, arXiv 1910.07113 (2019). 4. [↵][24]1. M. Belkin, 2. D. Hsu, 3. S. Ma, 4. S. Mandal , Proc. Natl. Acad. Sci. U.S.A. 116, 15849 (2019). [OpenUrl][25][Abstract/FREE Full Text][26] 5. [↵][27]1. P. W. Battaglia et al ., arXiv 1806.01261 (2018). 6. [↵][28]1. R. Sutton , “The bitter lesson”; [www.incompleteideas.net/IncIdeas/BitterLesson.html][29]. 7. [↵][30]1. R. Brooks , “A better lesson”; . 8. [↵][31]1. J. Schmidhuber , Evolutionary Principles in Self-Referential Learning (Technische Universität München, 1987). 9. [↵][32]1. D. Lindley, 2. A. F. M. Smith , J. R. Stat. Soc. B 34, 1 (1972). [OpenUrl][33] 10. [↵][34]1. C. Finn, 2. P. Abbeel, 3. S. Levine , Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1126–1135. 11. [↵][35]1. Y. Duan et al ., arXiv 1611.02779 (2016). 12. [↵][36]1. F. Alet et al ., Proc. Mach. Learn. Res. 87, 856 (2018). [OpenUrl][37] 13. [↵][38]1. Y. Lecun, 2. L. Bottou, 3. Y. Bengio, 4. P. Haffner , Proc. IEEE 86, 2278 (1998). [OpenUrl][39] 14. [↵][40]1. A. Rosenfeld , ACM Comput. Surv. 1, 147 (1969). [OpenUrl][41] 15. [↵][42]1. D. H. Hubel, 2. T. N. Wiesel , J. Physiol. 195, 215 (1968). [OpenUrl][43][CrossRef][44][PubMed][45][Web of Science][46] Acknowledgments: The author is supported by NSF, ONR, AFOSR, Honda Research, and IBM. I thank T. Lozano-Perez and students and colleagues in the CSAIL Embodied Intelligence group for insightful discussions. [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: #ref-4 [5]: #ref-5 [6]: #ref-6 [7]: #ref-7 [8]: #ref-8 [9]: #ref-9 [10]: #ref-10 [11]: #ref-11 [12]: #ref-12 [13]: #ref-13 [14]: #ref-14 [15]: #ref-15 [16]: #xref-ref-1-1 "View reference 1 in text" [17]: {openurl}?query=rft.jtitle%253DIEEE%2BTrans.%2BSyst.%2BMan%2BCybern.%26rft.volume%253D13%26rft.spage%253D834%26rft_id%253Dinfo%253Adoi%252F10.1109%252FTSMC.1983.6313077%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [18]: /lookup/external-ref?access_num=10.1109/TSMC.1983.6313077&link_type=DOI [19]: /lookup/external-ref?access_num=A1983SC59500020&link_type=ISI [20]: #xref-ref-2-1 "View reference 2 in text" [21]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DSilver%26rft.auinit1%253DD.%26rft.volume%253D362%26rft.issue%253D6419%26rft.spage%253D1140%26rft.epage%253D1144%26rft.atitle%253DA%2Bgeneral%2Breinforcement%2Blearning%2Balgorithm%2Bthat%2Bmasters%2Bchess%252C%2Bshogi%252C%2Band%2BGo%2Bthrough%2Bself-play%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.aar6404%26rft_id%253Dinfo%253Apmid%252F30523106%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [22]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNjIvNjQxOS8xMTQwIjtzOjQ6ImF0b20iO3M6MjI6Ii9zY2kvMzY5LzY1MDYvOTE1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ== [23]: #xref-ref-3-1 "View reference 3 in text" [24]: #xref-ref-4-1 "View reference 4 in text" [25]: {openurl}?query=rft.jtitle%253DProc.%2BNatl.%2BAcad.%2BSci.%2BU.S.A.%26rft_id%253Dinfo%253Adoi%252F10.1073%252Fpnas.1903070116%26rft_id%253Dinfo%253Apmid%252F31341078%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [26]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE2LzMyLzE1ODQ5IjtzOjQ6ImF0b20iO3M6MjI6Ii9zY2kvMzY5LzY1MDYvOTE1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ== [27]: #xref-ref-5-1 "View reference 5 in text" [28]: #xref-ref-6-1 "View reference 6 in text" [29]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html [30]: #xref-ref-7-1 "View reference 7 in text" [31]: #xref-ref-8-1 "View reference 8 in text" [32]: #xref-ref-9-1 "View reference 9 in text" [33]: {openurl}?query=rft.jtitle%253DJ.%2BR.%2BStat.%2BSoc.%2BB%26rft.volume%253D34%26rft.spage%253D1%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [34]: #xref-ref-10-1 "View reference 10 in text" [35]: #xref-ref-11-1 "View reference 11 in text" [36]: #xref-ref-12-1 "View reference 12 in text" [37]: {openurl}?query=rft.jtitle%253DProc.%2BMach.%2BLearn.%2BRes.%26rft.volume%253D87%26rft.spage%253D856%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [38]: #xref-ref-13-1 "View reference 13 in text" [39]: {openurl}?query=rft.jtitle%253DProc.%2BIEEE%26rft.volume%253D86%26rft.spage%253D2278%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [40]: #xref-ref-14-1 "View reference 14 in text" [41]: {openurl}?query=rft.jtitle%253DACM%2BComput.%2BSurv.%26rft.volume%253D1%26rft.spage%253D147%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [42]: #xref-ref-15-1 "View reference 15 in text" [43]: {openurl}?query=rft.jtitle%253DThe%2BJournal%2Bof%2BPhysiology%26rft.stitle%253DJ.%2BPhysiol.%26rft.aulast%253DHubel%26rft.auinit1%253DD.%2BH.%26rft.volume%253D195%26rft.issue%253D1%26rft.spage%253D215%26rft.epage%253D243%26rft.atitle%253DReceptive%2Bfields%2Band%2Bfunctional%2Barchitecture%2Bof%2Bmonkey%2Bstriate%2Bcortex%26rft_id%253Dinfo%253Adoi%252F10.1113%252Fjphysiol.1968.sp008455%26rft_id%253Dinfo%253Apmid%252F4966457%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [44]: /lookup/external-ref?access_num=10.1113/jphysiol.1968.sp008455&link_type=DOI [45]: /lookup/external-ref?access_num=4966457&link_type=MED&atom=%2Fsci%2F369%2F6506%2F915.atom [46]: /lookup/external-ref?access_num=A1968A792100014&link_type=ISI
领域	气候变化 ; 资源环境
URL	查看原文
引用统计
文献类型	期刊论文
条目标识符	http://119.78.100.173/C666/handle/2XK7JSWQ/291202
专题	气候变化资源环境科学
推荐引用方式 GB/T 7714	Leslie Pack Kaelbling. The foundation of efficient robot learning[J]. Science,2020.
APA	Leslie Pack Kaelbling.(2020).The foundation of efficient robot learning.Science.
MLA	Leslie Pack Kaelbling."The foundation of efficient robot learning".Science (2020).