LOCOMOTION, VISION AND INTELLIGENCE A THESIS PROPOSAL HANS MORAVEC Version 2: JUNE 11, 1974 VISION, INTELLIGENCE AND LOCOMOTION IN NATURE An interesting aspect concerning the evolution in living organisms of the abilities AI is trying to imitate is the fact that these characteristics occur exclusively in large mobile animals. The existence of a minimum size for nerve cells seems to explain the complexity constraints on small animals, such as insects, but the role of mobility in the development of imaging vision and intelligence is more subtle. That such a relationship exists is indicated by the fact that no plants or sessile animals (what few there are) have a very complex nervous system, and that there are several independent instances in which vision and comparative intelligence have evolved in the presence of mobility. What we usually (chauvinistically) regard as the mainstream of the development of higher organisms, namely the evolution of mammals from fishes, through the intermediate amphibian and reptile stages, represents one such instance. The imaging eye, and presumably some of the processing circuitry required to make effective use of it, seems to have been developed in this branch at roughly the same time as a backbone (i.e. in motile protofish), sometime in the Paleozoic, about 450 million years ago. The level of intelligence (as indicated by brain size) seems to have remained largely unchanged through the amphibian and reptile stages (characterized by slow moving animals), accelerating sharply coincident with the transition to the more mobile mammalian form, about 100 million years ago. The development of intelligence in birds, which also have reptilian ancestry, parallels that of mammals, and seems also to have been spurred by the needs of mobility. Although the dynamics of flying places an upper bound on their size, several kinds of birds have reached a level of intellectual complexity exceeded only by the highest mammals, and then only in some respects. Crows, for instance, have a long history of outsmarting farmers. There have been experiments which indicate that the intuitive number sense (the ability to perceive the cardinality of a given set of objects, without counting) of birds of this kind extends to seven, while it is limited to no more than three or four in man. Admittedly such a sense is probably more useful to a bird, which can use it to keep track of the number of eggs in its nest. The most astounding evidence for the link between mobility and imaging vision and intelligence comes from the phylum mollusca, which includes (among other things) snails, oysters, chitons, squid and octopi. The most recent (i.e. last) common ancestor that we share with these creatures is a bilaterally symmetric pre-worm, dated about one billion years ago, little more than a cell colony and probably one of the first reasonably mobile creatures more than a few cells big. This animal probably had a nervous system consisting of a small number of neurons, distributed approximately homogeneously throughout its body, and it certainly did not have anything more than the most primitive light sensing ability. Some descendants of this animal became fishes (over a period of 500 million years), and eventually mammals and birds. An independent evolutionary branch from the same root produced the modern mollusks. Most of these have shells, are almost sessile and have poor senses, light sensitive patches of skin being the closest approximation to eyes, and very simple nervous systems. Octopi and squid are a dramatic exception to this rule. These animals seem to have shed their shells at some (unknown) time in the past and opted for mobility. Presumably in response to the needs of this mobility, they developed an imaging eye and a complex nervous system, as big as that of most mammals. The eye differs from the mammalian (fish origin) one in the fact that the light sensitive cells in the retina point outwards, towards the lens (as one would think reasonable), rather than inwards, thus eliminating the need for the blind spot found in our eyes, and that the eye is hemispherical, as opposed to spherical, and firmly attached to the skin around it, instead of pivoting in a socket. The nervous center is also peculiar by our mainstream standards. It is annular, encircling the esophagus, and is organized into several connected clumps of ganglia. The intelligence of these animals, as viewed from the outside, has not been extensively investigated, and is not traditionally held in high esteem. This point of view is probably more a function of our ignorance and the unnatural conditions under which squid and octopi are usually observed, than of their actual mental abilities. There is a Cousteau film about octopi, in which an octopus' response to a "monkey and bananas" problem is investigated. A fishbowl sealed with a large cork, and containing a small lobster, is dropped into the water near the animal. The octopus is immediately attracted, seemingly recognizing the food by sight. It spends a while probing the container and attempting to reach the lobster from various angles, unsuccessfully. Then, apparently purposefully, it wraps three or four tentacles around the bowl, and one about the cork, and pulls. The cork comes free and shoots to the surface, and the octopus reaches a free tentacle into the bowl to retrieve the lobster, and eats. This is, of course, not conclusive evidence of high intelligence, but it is suggestive. Few dogs (say) could have done as well. It is clear, in any case, that squid and octopi, the most mobile of all the large invertebrates, also have, by an incredibly large factor, the most highly developed nervous system, and the only real eyes. Unfortunately there are no other known examples of the evolution of an imaging eye and a complex nervous system. Insect compound eyes, which evolved independently of both mainstream and octopus eyes, seem to be organized as they are to minimize the amount neural processing needed to make their output useful. They have no focusing or iris mechanism, lack of which would make an imaging eye almost useless, yet the output of each cell in the eye corresponds to the light intensity of a particular portion of the scene surrounding the insect. This makes for easy detection of the direction of an intrusion, but is useless for resolving fine detail. In any case, the size limitations on insects, probably caused by an evolutionary trap involving the difficulty of expanding an exoskeleton in small steps and of improving their diffusion limited breathing mechanism, constrains their nervous systems to the order of a thousand neurons, as opposed to a hundred billion in the higher vertebrates (and some mollusks). LOCOMOTION, VISION AND INTELLIGENCE IN TECHNOLOGY The rate of technological evolution is much greater than that of even the latest (and fastest) forms of biological evolution, for various reasons, involving much better cross-fertilization of new developments, existence of symbolic methods which can optimize a process without the need for time consuming experiments, a higher level of goal direction, and other things. The general trend of technological development, however, is much the same as it has been for living organisms. Machines can be viewed as creatures living in a symbiotic relationship with human beings. We provide them with their means of reproduction, and sometimes maintenance, in return for various services. There are many ecological niches in human society into which machines can fit, and the competition for these niches is just as fierce and bloodthirsty as the similar competition in nature, many kinds of machine having become extinct when a more effective competitor came along. Interestingly, this contest is not entirely restricted to mechanical devices, and in the past many of the living spaces now occupied by machines belonged to human beings and animals. In a very real sense, for instance, the various species of automobile have pushed the less effective domestic horse out of the transportation niche, and thus reduced its numbers relative to the human population, and its probability of long term survival. Arthur Clarke (I think) has said that an extraterrestial intelligence, on first seeing the earth would conclude that the automobile was the dominant form of life. It is clear to me that, by many reasonable measures, at least in America, such an observation would be quite correct. Since the conditions under which machines are evolving are similar to those which shape biological development, we would expect some of the same results. This is, in fact, the case. The relation between mobility and vision and intelligence, in particular, is almost as evident as in animals, in spite of the fact that most of the seeing and thinking in early vehicles was done by a human driver. Airplanes and missiles now have radar eyes and autopilots. Submarines have evolved imaging sonars and inertial guidance systems. Automobiles are slowly acquiring some of the same characteristics, in the form of automatic braking systems and collision avoidance radars, among other things. LOCOMOTION, VISION AND ARTIFICIAL INTELLIGENCE The facts that motility provided the only conditions under which vision and intelligence have ever evolved in nature, and that a similar process is underway in our technology, do not constitute a complete proof that the only road to artificial intelligence with any hope of success involves a mobile vehicle (the examples from nature, in fact, could be used almost as effectively as an argument for doing our research underwater). They do indicate that an approach from this direction should be tried, since, after all, it is the only proven path. The things which make locomotion such a powerful force in the shaping of intelligence (and vision) probably have something to do with the variety of situations a mobile organism encounters, and the kinds of reaction (e.g. investigating, running away, ignoring, etc.) open to it. This variety places a great premium on general techniques, and makes highly specialized methods, which may be optimal for a sessile creature, less valuable. These general, and complex, processes seem to have led to relative intelligence in animals in two entirely independent instances. It is my hope that intensive work on making a real vehicle negotiate a real environment will lead to the discovery, mostly by accident, of general techniques of the kind possessed by vertebrates and by octopi and squid. These techniques might then (sooner or later) be adapted for use in a system as clever as these animals. There is of course a possibility that AI can bypass some of the steps nature required to acheive intelligence, and that a sessile hand-eye, or, more likely, a sessile theorem prover, based on a language capable of representing all knowledge, and obtaining its experience of the world through communication instead of locomotion, will achieve intelligence (by whatever standard) before the descendants of a cart project do. There is also the possibility that these direct approaches are meeting so much difficulty because some crucial, as yet unappreciated, mechanisms are missing, mechanisms which biological evolution discovered while solving the problems of mobility, and which AI may be able to find in the same way. VISION, INTELLIGENCE AND THE CART PROJECT It is not possible, by definition, to make brilliant discoveries on demand. It is possible to create interesting problems of a kind that might inspire such discoveries, and to work on them diligently. The cart project presents a mass of such problems, and the fact that nature discovered intelligence twice while solving similar ones puts an aura of credibility and excitement on the whole enterprise. For my thesis I intend to write programs which make the cart navigate, avoid obstacles, dodge moving objects, recognize objects that may affect its motion in the future (such as cars, animals and people, which may begin to move unexpectedly), and do other things that may yet be discovered to be important for safe and effective control. The order that these things are done will probably be guided by a schedule of tasks similar to the one outlined in my proposal of May 20. It might be useful if the environment were considered more actively hostile than that list presumes. The value in my results will stem from the fact that my emphasis will be on the isolation or discovery of methods that really work. I expect to examine and modify or discard many existing algorithms and techniques, and to develop a few myself (and to discard these just as ruthlessly if they prove ineffective). These methods will include many approaches to vision (at the moment I can think of correlation, edge fitting, texture measuring, region growing), to control heuristics (what distance should an obstacle be avoided by, when is it better to run in the opposite direction, what kind of a trajectory works best, etc.) and to planning (avoidance of hostile moving obstacles might involve feints and other such subtleties, which have to be worked out a little in advance). My thesis would include a thorough description of the methods I had tried, and how effective (or ineffective) each had been, and an analysis, to the best of my ability, of the reasons for its success or failure. Time, of course, is the critical ingredient in this scheme. The amount of work is open ended, and I am unable to estimate the rate at which it will progress. It seems clear that the most effective way to proceed is slowly and carefully, since undue haste could lead to premature rejection of workable methods, or acceptance of inadequate ones. This probably means that the amount of work associated with a thesis would only be a small fraction of the effort needed to carry out a truly effective experimental program of this kind. I would expect to continue it after completing a degree. DETAILS Charting the course of the thesis is difficult because the exact path to be taken depends on the hardware facilities available, and on intermediate discoveries made along the way. It is possible to describe the overall goal, the early stages and the general approach at the later decision points. Since I hope to obtain a degree within two or three years, it is clear that the general problem of intelligence via locomotion (whatever the details of that process might be), will not be solved in this thesis. The most visible aspect of what should have been achieved in this time will be a program which, on command, causes a vehicle outside the lab to journey from its present location to a desired one, under as wide a range of circumstances as possible. Conditions that the system should definitely be able to deal with are unexpected obstacles, fixed and moving, lighting conditions which make seeing less than optimal (although the dynamic range of a conventional vidicon is sufficiently bad that under some conditions it would be reasonable for the system to balk) and specific special types of terrain features, such as people and animals, whose behaviour is considered to be erratic, and with whom collision is to be avoided even more carefully than usual. Depending on how difficult and time consuming development of all of this turns out to be, additional features are contemplated. It might be interesting, for instance, to provide a mode in which people are considered hostile, and one of the primary considerations in the vehicle's travels is the minimization of visual contact with them (i.e. the cart skulks from place to place), or one in which it displays a tropism for people and hovers just outside of collision distance from them. Another possibility is a road sign recognizing and rule of the road obeying mode. The less visible, but more significant, result will be a sorting out of possible vision, navigation and control techniques into categories which indicate whether or not they are good for particular portions of the above tasks. This classification could be used to guide later research, and to aid in the design of future vision equipped vehicles (and perhaps of less closely related things). The most difficult immediate problems concern short range obstacle detection. The visual problems faced by the cart are quite different from those so far tackled by the hand-eye crowd, and the few methods they have developed, except for the most primitive, are largely inapplicable. Instead of a static scene consisting of a few simple objects on a well defined ground plane, we have a rapidly changing image occasionally containing a few complicated objects on a usually poorly defined background. The motion is a definite asset, since it gives the effect of many cameras, positioned along the path. These multiple views can be used to determine the location, in three dimensions, of various points in the scene. What is needed is a mechanism for finding the same points in different images, and a procedure for deciding where the ground lies, and with that information, which points are above the ground and in the way of the motion. The vehicle can then be directed around such an obstacle by another procedure. The simplest, and currently most promising, method for locating the same small feature in two pictures involves moving a small window around in one, trying to maximize a correlation coefficient between the points in it and in a similar fixed window in the other. This technique can be efficiently extended for use over large areas by growing the matched region outwards in small increments, the extent of the search required for these additional matches being small compared to that needed for the first one. Additional refinements will undoubtedly be required to prevent unnecessary computations from taking place, such as preliminary passes to decide which areas have a high enough variability to be amenable to the correlator, and predictors which, from past data, and knowledge about the cart's motion, indicate where particular features in successive pictures will probably lie, so as to shorten the search time. The control problems for the simple avoidance task can probably be solved more easily than the vision problems, but they are not entirely trivial. The route taken when avoiding an obstacle, for instance, should probably be chosen so as to minimize the portion of it not on a currently visible piece of road, so as to reduce the probably of encountering another obstruction. Experiment will undoubtedly reveal many other necessary and desirable features. The next major vision task concerns constraining the vehicle's motion to a desired section of road. This requires the ability to detect and recognize which parts of the scene are road and which are not. A leading candidate for doing this a texture guided region grower. Asphalt road can probably be distinguished from grass and dirt with a fairly simple measurement, and the assumption that the vehicle is on the road at the start of the process can be used to provide a seed for the region growing. Later improvements might make the the initial conditions more general. In case of failure of this approach, a line fitter, applied close to where the edges of the road are expected to be, might be tried. The task of detecting a specific type of non-geometric object, such as a people, in the scene can be approached by a (scaled) template match, in which the template consists of a two dimensional mask, portions of which contain (coded) descriptions like "this region should be light", "this part should be the same intensity as the local background", "this part doesn't matter", etc. and weighting coefficients. The templates can be scaled by the measured distance of the portion of the scene on which it is being applied, or by trial and error. A simple correlation with a stored picture of the thing being tested for could be a first approximation, and this has the advantage that new objects can be easily introduced. Control strategies for interacting with beings recognized in this manner can be decidedly non-trivial, depending on the task set for the vehicle, and developing, and observing, them will probably be one of the most entertaining aspects of this research. The methods that will be tried to accomplish the more advanced goals are not yet well defined, and await the amassing of experience with the easier problems. The general approach, however, will be to choose two or three techniques which look as if they might work, to implement them as painlessly as possible, to get rough idea of their actual utility, and then to spend whatever effort is required to refine the most promising approach, possibly backtracking if a particular decision turns out to have been a mistake.