LOCOMOTION, VISION AND INTELLIGENCE HANS P. MORAVEC THE ROBOTICS INSTITUTE CARNEGIE-MELLON UNIVERSITY PITTSBURGH, PA 15213 2 JULY 1983 The thoughts presented here never appeared in research proposals, but nevertheless grew at the Stanford University Artificial Intelligence Laboratory over the years 1971 through 1980 under support from the National Institutes of Health, the Defense Advanced Research Projects Agency, the National Science Foundation and the National Aeronautics and Space Administration, and more recently at the Carnegie-Mellon University Robotics Institute under Office of Naval Research contract number N00014-81-K-0503. 1 ABSTRACT Much robotics work is motivated by a vision of machines able to function with human competence in intellectual, perceptual and physical endeavors. This paper addresses the questions "What are the major missing ingredients for intelligent robots?", "What directions of exploration are most profitably pursued to obtain them?", "When can human performance be expected?" and "What consequences of such machines can we anticipate?". I believe that computer size and speed is the pacing factor in the development, and that broad human level performance will be achieved as 12 12 processors with 10 bytes of memory able to do 10 instructions per second become widely available in about twenty years. I also believe that mobile robotics leads most directly to the necessary software and mechanics. Superhuman performance by machines in many narrow intellectual and physical areas will precede general human level competence. It is hard to overestimate the effect of rapidly evolving superintelligent machines on humanity and on the universe. Introduction The obvious topic of this paper would be a progress report on the CMU Rover, our expensive mobile robot project. Since readily available reports on this have recently been published [17] [3], I feel free to present a wider ranging speculation that is intertwined with my research. It begins with a personal view of the history of Artificial Intelligence. History When computers first became widely available in the mid 1950s some visionaries sought to apply their unprecedented power and flexibility to the problems of building mechanisms that thought and acted like humans. To some the power of the available machines seemed more than adequate - like large power machinery before them, computers were built to do work that would otherwise require an army of unaided workers. With that assumption the problem seemed the exciting scientific one of finding the correct algorithms to mechanize thought. The early fruits of these efforts seemed ample confirmation of the premise. During the late 1950s and early 1960s programs were written that proved theorems in geometry and logic, solved problems in algebra, calculus and wider domains, gave creditable performances in intellectual games, exhibited learning and in general functioned near the epitome of human thought, doing things that only some humans could do well, and no other creatures could do at all [6]. These pioneering programs were almost exclusively in carefully hand-optimized machine language and typically ran in 32K words at 10K instructions/second, on machines like the IBM 605 and 704. Perhaps encouraged by this first wave of success, a small group at MIT 2 pressed on through the mid 1960s into new areas - hands, eyes and ears to go with the artificial brains. The winning streak seemed to hold. In a few years systems controlling remote manipulator arms [5], seeing through computer- connected scanners and cameras, and interpreting three dimensional scenes of simple planar-faced objects [20], began to work, eventually together, to accomplish tasks like clearing a tabletop of scattered blocks [13]. Like the early high-level reasoners, these pioneering perceptual and motor programs were written in machine language. They ran in about 64K words of memory on early transistor computers like the TX-2 and the PDP-1, at 50K instructions/second. There was a qualitative difference between the relative performances of the eye-hand systems and the reasoning systems. Whereas at their best the problem solvers could mimic competent adult humans, the robot systems rarely achieved the co-ordination of a four year old. Some of the difference was ascribed to the greater difficulty of working with the robots - more than half the effort went into building and simply maintaining complicated and trouble-prone hardware. The pure thought programs, on the other hand, operated in a much cleaner environment, as by this time computers were pretty reliable. I think it was also suspected that workers good at the manual skills required for dealing with robot hardware were likely to be second-rate when it came to the theoretical insights necessary for thinking computers [9]. The mid to late 1960s saw a flowering of interest, effort and money in the area. Among its pioneers the task of giving computers intelligence became a full time job [14]. Faster, bigger new computers, notably the PDP-6 and the SDS-940, able to handle 128K word jobs at rates of about 100K instructions/second, were harnessed not only to run the smart programs, but to provide a working environment that fostered their development. Time sharing and high level languages became the dominant form of expression, greatly facilitating experimentation, though at some cost. Overheads and inefficiencies in the operating systems, interpreters and non-optimizing compilers ate up most of the size and speed gains of the new machines. Although many new ideas were tried in this period, the performance of these second generation thinking, perceiving and acting programs was not spectacularly better than that of the first generation [15]. In the late 1960s the new centers, running largely on optimism and momentum from the first generation of intelligent programs and their own successful system-building startups, attempted major integrations. Reasoning programs were applied to practical problems and whole robot systems, preceded by great promises, were built up using the none-too-reliable perceptual and high level methods of the previous work [19]. The promises greatly exceeded the results. The high level reasoners continued to hover at amateur adult performance level in narrow areas, while the robot systems, containing elements such as speech recognition, sight and planning, almost never worked at all. By the early 1970s there were hints of pessimism and cynicism. This negative spirit was made official in 1973 by the report of a British Government commission [12] and in 1974 when ARPA, almost the sole source of funding for this work in the U.S., announced large cutbacks for the major centers. In the 1970s there was a diversification in the funding and the focus, as small groups independently attacked the many problems that had been revealed in 3 the second generation efforts. The PDP-6 was replaced by the program-compatible PDP-10, able to execute 400K instructions per second and handle multiple 256K word programs, and then by the KL-10, able to run at a true million instructions/second. Many incremental advances were made, and while little that was amazingly different from the second generation was evident, some of the more conservative programs began to work reliably on a usably wide domain - so much so that commercial applications became possible. The workaday nature of the progress in the 1970s was disappointing to many who remembered the optimism and momentum of the first generation. Many reasons for the unmet expectations were offered, common among them that the exciting scientific problems so enthusiastically attacked fifteen years before were simply harder than they seemed, and significant further progress would require conceptual breakthroughs, probably to be made by future geniuses. Now, in the early 1980s, largely because of the immense commercial implications, we find ourselves in a new boom period. There is a cacophony of voices and a cornucopia of opportunities pulling the still small community of workers in all directions. I suppose this paper is my contribution to the melee. Opinion I believe the genius theory of the holdup is incorrect. Newton, Maxwell and Einstein are famous for being first to present solutions to important puzzles in physics, but others in the race would have stumbled onto the same answers given only an extra clue or two, or a little more time. As long as there is a steady stream of experimental results and an intellectual ferment trying to make sense of it, the answers will come. The field of artificial intelligence has, and has had, the necessary human prerequisites for rapid progress. In many ways the task of building intelligent machines is less chancy than comparable problems like the discovery of physical theories or the construction of controllable space transportation systems. In the latter cases there is no prior guarantee that the problem being attacked has a solution, whereas each one of us is a tangible existence proof of the possibility of intelligent mechanisms. The real thrill will come when our enterprise begins to consider entities with superhuman intelligence, an area where the answers can no longer be found at the back of the book. The presence of naturally evolved intelligence gives us more than an existence proof - it provides, via the evolutionary record, an estimate of the difficulty of the task of designing something like ourselves. Once in a while it offers a glimpse at the blueprints. Relating this information to the experiences of the intelligent machine effort provides grounds, I believe, for some specific predictions. Lessons from the Evolution of Life The process that developed terrestrial intelligence is, by best evidence and semantics, not itself intelligent. While high intelligence permits limited peeks into hypothetical futures, most powerfully by abstractions that allow 4 whole classes of possibilities to be examined as single entities, Darwinian evolution tests alternatives individually in the real world. The natural process requires many more experiments to explore the parameters of a design, and thus is slow, but in the long run its limitations are not so different from those of the artificial method. When the solution space of a problem is dense, successful designs are generated rapidly, and locally optimum solutions are soon discovered. If the space is sparse both methods take a long time to stumble upon an answer, and may fail to do so indefinitely. The evolutionary record is thus a rough guide to the difficulty of achieving particular design goals. If a function exists in nature, diligent effort should reveal an artificial counterpart. Success is particularly likely if the function evolved rapidly and more than once. Besides confidence and timing, a naturally evolved solution can provide design hints. Though we understand few natural structures well enough to do a complete "reverse engineering", parameters such as scale and complexity are usually evident. We can guess, for instance, that to uproot and move trees we need a machine with about the size and power of an elephant. The next section applies this reasoning to intelligence. Lessons from Natural Intelligence The intelligent machine effort has produced computer programs that exhibit narrow reasoning abilities at the performance level of amateur adult humans and perceptual and motor skills on a par with a grasshopper. The level of research effort on the two areas is the same. Why do the low level skills seem so much harder than the high level ones? The human evolutionary record provides the clue. While our sensory and muscle control systems have been in development for a billion years, and common sense reasoning has been honed for probably about a million, really high level, deep, thinking is little more than a parlor trick, culturally developed over a few thousand years, which a few humans, operating largely against their natures, can learn [2] [8] [21]. As with Samuel Johnson's dancing dog, what is amazing is not how well it is done, but that it is done at all. Computers can challenge humans in intellectual areas, where humans perform inefficiently, because they can be programmed to carry on much less wastefully. An extreme example is arithmetic, a function learned by humans with great difficulty, which is instinctive to computers. These days an average computer can add a million large numbers in a second, which is more than a million times faster than a person, and with no errors. Yet one hundred millionth of the neurons in a human brain, if reorganized into an adder using switching logic design principles, could sum a thousand numbers per second. If the whole brain were organized this way it could do sums one hundred thousand times faster than the computer. Computers do not challenge humans in perceptual and control areas because these billion year old functions are carried out by large fractions of the nervous system operating as efficiently as the hypothetical neuron adder above. 5 Present day computers, however efficiently programmed, are simply too puny to keep up. Evidence comes from the most extensive piece of reverse engineering yet done on the vertebrate brain, the functional decoding of some of the visual system by D. H. Hubel, T. N. Weisel and colleagues [11]. Vertebrate Vision, Speed and Storage The visual system in humans occupies ten percent of the brain. Using a 11 debatable figure of 10 neurons for the whole, seeing is done by an 10 organization of 10 neurons. In computer terms, the retina captures images 6 with 10 pixels and, with the visual cortex, processes them at a frame rate of 10 per second (flicker at higher rates is detected by special circuits). Neurons have been identified that respond selectively to many kinds of local feature; small points on contrasting backgrounds, intensity boundaries at particular angles, lines, corners, motion and more complex patterns. Each response is produced by a circuit involving between 10 and 100 neurons per 6 pixel. With 10 pixels in the image, each of these "local operators" involves 8 about 10 neurons. The higher reaches of the visual cortex are not as well understood, and likely involve structures different from the local operators found near the retina. If we nevertheless assume that the level of optimization of the higher functions is comparable to that found in the lower reaches, the neuron count suggests that the visual system is as computationally intense as at least 100 local operators. Local operators similar to the natural ones are used in many computer vision programs. Highly optimized versions of such operators, which must examine local neighborhoods around each pixel, require between 10 and 100 instructions per 6 pixel. On a 10 instruction/second machine a million pixel image can be processed by one local operator in about a minute. By the reasoning of the last paragraph the human visual system does tasks of the computational difficulty of 100 low level operators simultaneously, moreover at the rate of 5 10 per second, i.e. 1000 per second or about 10 per minute. The computer is 5 thus 10 times too slow to mimic the human visual system. The whole brain is 10 times larger than the visual system, suggesting that a 12 computer able to execute 10 instructions/second is adequate to emulate it, even if the bulk is as efficiently organized as the lower reaches of vision. We next address storage capacity. Recent evidence strongly suggests nervous system memory resides in the synapses between neurons [10], with short term conditioning and memory being controlled by the migration within the synapse of small molecules (that dissipate in time) and long term memory depending on synthesis of large, 6 stable, proteins at the same site. Both mechanisms affect the firing threshold of the synapse coarsely, an effect that can be encoded by a few bits. There 11 are 10 to 100 synapses associated with each of the 10 neurons, so the maximum 13 memory capacity of a human may be 10 bits, or one terabyte. Probably less would suffice for a brain emulator, since many of the synapses are in circuitry, like the low level visual system, that seems to require little memory. In a way these estimates are unfair to the machine. A teraop/second, terabyte computer, as I imagine it, would be a general purpose device programmable for a (stupendous!) range of functions besides human equivalence. This malleability is ideal for development, but comes at a price. The switches of the computer are organized to execute convenient machine instructions, while those in the nervous stystem are organized to control, perceive and think. The computer would do its controlling, perceiving and thinking at a higher, less efficient, level of modularization. A teracomputer just powerful enough to run the human equivalence programs could also run other code, and would consequently have serious advantages over a natural human. Imagine it temporarily overlaying a theorem proving, chess playing or air traffic controlling program over the human emulation code when the need arose. Alternatively, sacrificing these superhuman abilities, algorithms discovered on a general purpose system could be condensed into much smaller specialized hardware by expensive optimizing processes. Later in this paper I express my confidence that sufficiently powerful general purpose machines will become available gradually during the professional lifetimes of most reading this. If such computers are the engines of the first true artificial intelligences, the job of the AI and robotics community is to lay the track over which they will run. In the next section I argue that the most fruitful direction for this track is roughly along the oxtrail forged by natural evolution before stumbling on us. Goal directedness, and appreciation of some of nature's successes and failures, may help us skip a few meanders. The Utility of Mobility It is my view that developing a responsive mobile entity is the surest way to approach the problem of general intelligence in machines. The argument hinges on the observation made earlier that instinctive skills are much better developed in humans than high level thinking, and are thus the difficult part of the human emulation problem. From the performance of present programs and from calculations like those in the last section, I guess that amateur quality high level thinking can be done by an efficiently organized 8 system doing 10 instructions/second while average quality perception and 11 action requires 10 instructions/second. Master quality high level thinking by people may happen when large parts of the task are mapped into the computationally efficient perceptual or motor parts of the brain (enabling 7 solutions to be seen or felt?), and such expert performance should be 11 8 intermediate in difficulty between the 10 of low level and the 10 instructions/second of routine high level thought. Computing power is essential to the intelligent machine effort, and I believe it has been the pacing factor, but it alone is not sufficient. An intelligent teracomputer will depend on myriads of special programs, and we know too little to specify most of them. Without guidance we are sure to spend much effort on the wrong problems. Reverse engineering from living creatures offers clues, but few and slowly. I believe the best course is the natural evolution paralleling process of technical development suggested in the previous section, where guidance comes from the universe in the outcomes of many experiments. Once on this course the evolutionary record becomes our guidebook. I have argued that instinctive skills are overwhelmingly the difficult part of human intelligence, and the area most in need of development. Many animals share our instinctive skills, providing a basis for induction. A major conclusion I draw from this host of examples is that all animals that evolved perceptual and behavioral competence comparable to that of humans first adopted a mobile way of life. This is moot for vertebrates, which share much human evolutionary history, but is dramatically demonstrated among the invertebrates. Most molluscs are sessile shellfish controlled by thousand neuron nervous systems. Octopus and squid are molluscs that abandoned life in the shell for one of mobility; they developed imaging eyes, a large (annular!) brain, dexterous manipulators, an unmatched million channel color display on their surfaces, and behavior and learning complexity rivalling that of mammals [1]. Conversely, no sessile animal nor any plant is remotely this near to human behavioral competence. I conclude that a mobile way of life favors general solutions that tend towards intelligence, while non-motion favors deep specializations. A fixed organism is repeatedly exposed to a limited set of problems and opportunities, and will do better in the long run if it becomes good at dealing with this limited range. A roving creature encounters fewer instances of a greater variety of different conditions, and does better with general methods, even if such generality is more expensive, or results in poorer performance in specific instances. The cumulative effect of this difference in selection pressure is enormous, as evidenced by clams and octopus, or plants and animals. Trees are as successful and dominant in their niche as humans are in theirs, but the life of a tree does not demand high speed general purpose perception, flexible planning and precisely controlled action. I see the same pressures at work in the robotics effort. Most arm systems have special grippers, special sensors, and vision systems and controllers that work only in limited domains. Economics favors this, since an arm on an assembly line repetitively encounters nearly identical conditions. Methods that handle the frequent situations with maximum efficiency beat more expensive general methods that deal with a wide range of circumstances that rarely arise, 8 while performing less well on the common cases. Mobile robots have completely different requirements. Simple shape recognition methods are of little use to a machine that travels through a cluttered three dimensional world. Special grippers don't pay off when many different objects in arbitrary orientations must be handled. Linear, algorithmic control systems are not adequate for a rover that often encounters surprises in its wanderings. I feel experiences with the Cart at Stanford and the Rover at CMU vividly illustrate this selection pressure. "Blocks world" vision, still in fashion when I began the Cart work [16], was completely inappropriate for the natural indoor and outdoor scenes encountered by the robot, as were many of the other specialized methods then under study [19]. Much experimentation with the Cart eliminated several initially promising approaches that were insufficiently reliable when fed voluminous and variable data from the robot. The product was a vision system with a different flavor than most. It was "low level" in that it did no object modelling, but by exploiting overlapping redundancies it could map its surroundings in 3D reliably from noisy and uncertain data. The reliability was necessary because Cart journeys consisted of typically twenty moves each a meter long punctuated by vision steps, and each step had to be accurate for the journey to succeed. Our ambitious new work on the CMU Rover has produced another example [3]. We need a language to express Rover tasks and a hardware and software system to embody it. We considered something similar to Stanford's AL arm controlling language [7], from which the commercial languages VAL at Unimation [24] and the more sophisticated AML [22] at IBM were derived. Paper attempts at defining the structures and primitives required for the mobile application revealed that the linear control structure of these state-of-the-art arm languages was inadequate for a rover. The essential difference is that a rover, in its wanderings, is regularly "surprised" by events it cannot anticipate, but with which it must deal. This requires that contingency routines be activated in arbitrary order, and run concurrently. We are experimenting with a structure similar to that developed for the CMU Hearsay II speech understanding project [4]. Independent processes communicate via messages posted on a commonly accessible data structure called a blackboard. The Pace of Power If processing power is the pacing factor in the development of intelligent machines, and if one million times the power of present computers is needed, we can estimate the real time until full artificial intelligence arrives. Since the 1950s computers have gained a factor of 1000 in speed per constant dollar every decade [23]. There are enough developments in the technological pipeline, and certainly enough will, to continue this pace for the forseeable future. The processing power available to AI programs has not increased proportionately. Hardware speedups and budget increases have been dissipated on convenience features; operating systems, time sharing, high level languages, 9 compilers, graphics, editors, mail systems, networking, personal machines, etc. and have been spread more thinly over ever greater numbers of users. I believe this hiatus in the growth of processing power explains the disappointing pace of AI in the past 15 years, but nevertheless represents a good investment. Now that basic computing facilities are widely available, and thanks largely to the initiative of the instigators of the Japanese Supercomputer and Fifth Generation Computer projects [18], attention worldwide is focusing on the problem of processing power for AI. The new interest in crunch power should insure that AI programs share in the thousandfold per decade increase from now on. This puts the time for human equivalence at twenty years. The smallest vertebrates, shrews and hummingbirds, derive interesting behavior from nervous systems one ten thousandth the size of a human's, so we can expect fair motor and perceptual competence in less than a decade. By my calculations and impressions present robot programs are similar in power to the control systems of insects. Some principals in the Fifth Generation Project have been quoted as planning "man capable" systems in ten years. I believe this more optimistic projection is unlikely, but not impossible. The fastest present and nascent computers, 9 notably the Cray X-MP and the Cray 2 compute at 10 operations/second, only 1000 times too slowly. The Future Machines as intelligent as humans will, in their generality, be capable superhuman feats and will be able to do the science and engineering to build yet more powerful successors. I leave (almost certainly futile) speculation about the future evolution of this process for a future essay. It is clear to me we are on the threshold of a change in the universe comparable to the transition from non-life to life. 10 REFERENCES [1] Boycott, B. B. Learning in the Octopus. Scientific American 212(3):42-50, March, 1965. [2] Buchsbaum, R. Animals Without Backbones. University of Chicago Press, Chicago, 60637, 1948. [3] Elfes, A. and S. N. Talukdar. A Distributed Control System for the CMU Rover. In The 8th International Joint Conference on Artificial Intelligence, Karlsruhe, West Germany. IJCAI, August, 1983. [4] Erman, L. D. and V. R. Lesser. The HEARSAY-II speech-understanding system: Integrating knowledge to resolve uncertainty. Communications of the ACM 23(6), June, 1980. [5] Ernst, H. A. MH-1, A Computer-Operated Mechanical Hand. PhD thesis, Massachusetts Institute of Technology, December, 1961. [6] Feigenbaum, E. A. and J. Feldman. Computers and Thought. McGraw-Hill Book Company, San Francisco, California, 1963. [7] Goldman, R. and Shahid Mujtaba. AL User's Manual, Third Edition. Computer Science STAN-CS-81-889 (AIM-344), Stanford University, December, 1981. [8] Goodrich, E. S. Studies on the Structure and Development of Vertebrates. Dover Publications Inc., New York, N.Y. 10014, 1958. [9] Scientific American. Special issue on Information. Scientific American, Inc., New York, NY, September, 1966. Later published as a book by W. H. Freeman, San Francisco, CA. [10] Kandel, E. R. and J. H. Schwartz. Molecular Biology of Learning. Science 218(4571):433-443, October 29, 1982. [11] Kuffler, S. W. and J. G. Nicholls. From Neuron to Brain. Sinauer Assoc., Inc., Sunderland, MA 01375, 1976. 11 [12] Lighthill, Sir James. Machine Intelligence Research in Britain. Technical Report, Science Research Council of Great Britain, 1972. [13] Minsky, M. Project MAC Robotics. MAC M-75, M-258, TR-37, Massachusetts Institute of Technology, 1965. [14] McCarthy, J. Plans for the Stanford Artificial Intelligence Project. Stanford AI Memo 31, Stanford University, 1965. [15] McCarthy, J., A. Samuel, E. Feigenbaum, J. Lederberg. Project Technical Report. Stanford AI Memo AIM-143, CS-209, Stanford University, March, 1971. ARPA contract report. [16] Moravec, Hans P. Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover. PhD thesis, Stanford University, September, 1980. published as Robot Rover Visual Navigation by UMI Research Press, Ann Arbor, Michigan, 1981. [17] Moravec, H. P. The Stanford Cart and the CMU Rover. Proceedings of the IEEE 71(7), July, 1983. also in IEEE Transactions on Industrial Electronics, July 1983. [18] Moto-Oka, T. (editor). Fifth Generation Computer Systems. Elsevier Science Publishing Co. Inc., 52 Vanderbilt Avenue, New York NY 10017, 1981. [19] Raphael, B. The Thinking Computer. W. H. Freeman and Company, San Francisco, California, 1976. [20] Roberts, L. G. Machine Perception of Three-Dimensional Solids. In Tippett, J. T. et al (editor), Optical and Electro-Optical Information Processing, , pages 159-197. MIT Press, Cambridge, Massachusetts, 1965. Originally Technical Report No. 315, Lincoln Lab, MIT, May 1963. [21] Ross, H. H. Understanding Evolution. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1966. [22] Taylor R. H., P. D. Summers, and J. M. Meyer. AML: A Manufacturing Language. Research Report RC-9389, IBM, April, 1982. 12 [23] Turn, R. Computers in the 1980s. Rand Corporation, Columbia University Press, New York, NY, 1974. [24] Unimation, Inc. User's Guide to VAL, A Robot Programming and Control System, Version 11. Technical Report, Unimation, Inc., February, 1979.