CART PROJECT PROGRESS REPORT HANS MORAVEC JULY 24, 1974 revised JAN 7, 1975 The Previous Month Following the cart meeting of June 13, in which it was agreed that McCarthy would buy the cart project a TV transmitter if I could demonstrate my ability to do work on vision by writing a program which extracted three dimensional information from a motion stereo sequence of pictures, I began work on this task. So that there would be no doubt as to who had done the work, and because I operate most comfortably and effectively in a programming environment made to my own tastes, the first order of business was the writing of a set of graphics and picture utility routines. The resulting package provides more flexibility than Quam's III buffer based routines, and is more cohesive than Baumgart's set, which is spread out among several programs. Currently the package accesses Data Disc displays only, in future it will be extended to provide output to the XGP and maybe the III's. As an example of increased capabilities, it allows display of filled in polygons (including convex and star polygons), and permits displays (including grey scale pictures and text) to be shrunk or expanded arbitrarily merely by changing the declared dimensions of the screen or the requested sizes of the individual parts. Also, selected areas of the display may be erased or complemented, as well as added to. The programs described below use no previously existing software other than the FAIL and SAIL compilers and the operating system. A sequence of 30 pictures taken by Baumgart several years ago, along the center of the road around the lab at about 12 ft intervals, provided the data for the first try. The general approach I had in mind was simply to isolate a reasonably small number of interesting features in one picture, and to find these in the next one (or several), locating the horizon and deducing from this the location of the other points. The first problem was the initial selection of promising features in one image. I decided, for lack of being able to think of anything better, that each feature would be a square window n pixels by n pixels, n being an adjustable parameter. The problem now was to select a scattering of such windows over the picture, preferably with the property that the selected areas would be better candidates for correlation than the neighboring regions. Variance, the sum of the squares of the difference of the samples from the mean, is a conceivable measure which can guide the program in the this selection, but it is unable to distinguish genuinely good regions from those which consist of parallel streaks in a particular direction. A correlator can slide along such streaks without the calculated correlation changing, not a desirable state of affairs. Discussions did not reveal any established measures that were better. Instead of using variance, I decided to create a measure of my own with the right properties. My operator calculates, for each of horizontal, vertical and two diagonal directions, the sum, over the entire window, of the squares of the differences of pairs of pixels adjacent in the indicated direction. The measure for the window is the minimum of the four sums so obtained (I considered using the product instead of the minimum, but this made the numbers unwieldy. I also tried using absolute values instead of squares, but squares (which accentuate large differences) work better (the squaring is done at almost no expense by table lookup). Higher powers result in overly large numbers, though this can be fixed by scaling. The possibility of resolving direction more finely than the 45 degrees implied by my 4 sums also occurred to me, but in small windows of the kind I contemplated, this didn't seem worthwhile). This measure has the property that, if all the variation in a window is in one direction, one of the sums, and consequently the measure itself, will be close to zero. The measure is also small, of course, if a region simply has little variation in general. If the numbers in the digitized picture are proportional to the logarithm of the amount of light (read "number of photons") reaching the camera, the above measure has the desirable property of being invariant under changes of illumination. If the numbers are linear in the amount of light, the measure should be divided by the square of the mean over the window, to remove the effect of illumination. I found that doing this caused the dark areas to be overly accentuated, suggesting that the camera photometry function is more nearly logarithmic than linear. The measure finally decided upon is the simple minimum of the four sums of squares of differences of adjacent pixels. Because it resulted in a faster running program, I decided to partition the picture into disjoint n by n windows, to each of which the operator is applied, instead of moving the operator over the picture in single pixel steps. Doing this, and selecting those windows where the operator resulted in a number bigger than a given threshold (I usually use the average of the operator over all windows for this), resulted in clumps of adjacent windows (e.g. over the image of stands of trees) being selected. This was undesirable, since it meant that many points corresponding to pretty much the same object would have to be examined in the subsequent steps. I tried to solve this by keeping only those windows which were not only above the threshold but which were also a local maximum, determined by comparisons with its eight neighbors. This weeding out resulted in a gratifyingly sparse scattering of selections. This "interest" operator, including the local maximum test, was tightly hand coded for the PDP10, and observed to take about 6 seconds run time with a 250 by 250 picture and a window size of 10, increasing slightly for smaller window sizes. Contemplating the run time of this simplest and most tightly coded of operators made it obvious to me that serious vision was essentially impossible with computers of the kind we are currently using (including the signal processor). A 500 by 500 point picture has a quarter of a million pixels (not 25,000 as indicated in a McCarthy article I recently ran across), and any operation applied to all of it takes one second for every four microseconds spent on each pixel. Even the simplest processing needs several instructions for each pixel, occupying many microseconds, so the simplest operators require many seconds to do a picture. Simple operators (like correlation) which require a number of computations more than linear in the number of pixels, take, at best, on the order of minutes. More complicated operators can be conceived which take hours or days. In practice these more complicated processes are not even tried, or are quickly abandoned when their run time is noted. It is my contention that much of the difficulty and lack of success in computer vision stems from the fact that we are trying to do it with a inanely small amount of compute power, and that passable vision may not be possible without the equivalent of (say) 24 hours of PDP 10 time per image (which would not be conducive to many experiments being made). It is to be noted that the human eye processes million point images at the equivalent rate of about ten per second, and applies many complicated operators over the whole image in the earliest stages of the optic nerve, with even more neural circuitry devoted to vision occurring further downstream, in the visual cortex. Similar observations may hold in other aspects of AI, and it is possible that the attitude that serious results in this field can be obtained with conventional computers is hardly more reasonable than the opinion that the results can be obtained without computers entirely, since we can always simulate the programs by hand, on pencil and paper. While thinking about this sad state of affairs, and about other problems, a design for a multiprocessor, a small version of which could be constructed even today with microprocessors, which would be entirely general purpose, but specially applicable to vision and physical simulations (e.g. three dimensional graphics) and combinatorial problems, occurred to me. Further discussion of it is out of place here, other than to say that it could reasonably be built on a scale to equal or better the processing power of the optic nerve or even the whole brain, while being easy (even fun) to program (unlike, say, Illiac IV, which is really a special purpose machine). The feature extraction routine now worked satisfactorily on most cart images, but Baumgart's pictures are noisier than most and in those the program tended to pick out many uninteresting patches of road, as well as good features. My operator is quite sensitive to high frequency noise, since the comparisons in it are strictly between adjacent elements in the picture. I was able to alleviate the problem by writing a preprocessor for the picture which constrains the value of each pixel to the range delimited by its four (vertical and horizontal) neighbors, that is it may be no bigger than the second largest or smaller than the third largest of the neighbors. Points above this maximum are set to it, and those below the minimum are set to that. This has the effect of removing aberrations a single pixel in extent in either direction, and noticeably blurs the picture (but improves the functioning of my other operator). Surprisingly the run time of this cleaning up process is longer (by about 50%) than of the interest operator, in spite of the fact that it too is tightly hand coded. These two operators can obviously be combined into a single one which would run faster than the sum of the run times of both, but at the moment doing this would probably be a waste of time since the Baumgart pictures are atypical, and since better techniques for finding interesting features in noisy pictures can and will undoubtedly be thought of. Now that interesting things in one image could be located, it was time to write programs which could find them in the next. The first problem here was to decide the measure by which a good match would be detected. Quam uses normalized correlation, a measure derived for finding the correlation between the x and y values of a set of points in a two dimensional scatter diagram. If the values of the pixels in each window of the two windows between which the correlation is to be calculated are considered as an n^2 dimensional vector (n^2 being the number of points in each window), then this correlation coefficient is the normalized dot product of the two vectors (i.e. the cosine of the angle between them) after each one has had the mean of its components subtracted from each component. The critical property is that it is invariant under a linear transformation of the points in either window. This may be desirable if it is expected that the illumination or albedo of an area can change between the two pictures, as is the case in the Mars pictures, (although it is difficult to imagine what multiplication by a constant factor means if the encoding is the log of brightness), but probably throws information away unnecessarily if the pictures are taken under similar conditions (as on the cart). Also, it requires the computation of three sums of squares in the inner loop of the correlation (which happens a number of times approximately equal to the number of pixels in the correlation window (n^2), multiplied by the number of pixels in the search window (which can be a good fraction of the entire image)). In addition, each trial window requires a square root and several multiplications and divisions (happens a number of times equal to the number of pixels in the search window), which make a non trivial contribution to the run time. An alternative measure is the sum of squares of differences, which requires computation of only one sum in the inner loop, and hardly any other computation. Implementation of an algorithm which moved a small test window around a larger rectangular window, calculating sums of squares of differences, and returned the window position which resulted in the smallest such sum, provided moderately satisfactory results, and revealed an unexpected problem. All of the pictures it was tried on have a radial intensity gradient, running from bright at the center to dark at the edges. This darkening is caused by the fact that the vidicon target is flat (as opposed to spherical), and light from the lens strikes it perpendicularly at the center, and at a slant towards the edges, giving less light per unit area there. It had the effect on my correlator of preventing a good match when a feature was significantly closer to an edge (and hence darker) in one picture than in the other. Normalized correlation would, of course, not be affected by this, since it subtracts out the mean of each window during the correlation. As mentioned before, however, the penalty payed for this is the computation of three sums in the inner loop of the correlation, instead of one. An operator which could correct the variation once for all of each picture would be considerably more efficient. The most satisfactory possibility would be one which determined the extent of the darkening and removed it, leaving the picture otherwise untouched. This is unfortunately difficult to acheive, since the extent of the gradient varies from camera to camera, day to day and scene to scene, and since, in a single picture, the effects of gradient and real image are difficult to isolate. A simpler fix is high pass filtering, which can be implemented by subtracting the average of a window about each pixel from the pixel. The combination of this preprocessing with the sum of squares of differences correlation is similar in effect to subtracting out the mean during the correlation, but faster, since the subtraction is done only once or each pixel, instead of the n^2 times. It has the drawback, in common with normalized correlation, that it unnecessarily throws away some information. I tried it in spite of this, and found that it solved the edge darkening problem, and did not noticeably degrade the performance in other ways. A related problem stemming from the geometry of the optics is that there is a distortion which causes the images of objects to be streched outwards an amount proportional to their distance from the center, especially with a wide angle lens such as the 60 degree one carried by the cart. Since the angular motion from frame to frame in a cart sequence is primarily from side to side (pan), I was able to correct for most of this effect by transforming the picture array, by shifting columns around, in a way tantamount to projecting the original picture onto a portion of a cylinder of correct radius. It is necessary to know the angular extent represented by the original picture to do this correctly. A feature of Baumgart's sequence of pictures is that there are few foreground features (mostly empty road), and those that there are (e.g. occasional pieces of white line), are so changed from picture to picture, due to the perspective change caused by the large forward motion between pictures, or by the fact that they have moved out of the field of view, as to be just about unrecognizable by human beings. Needless to say, my program does no better, and usually finds the wrong feature under such circumstances. One way of detecting such lossages would be by means of a measure which indicates the goodness of the match in some absolute way. The simple sum of squares of differences does not have this property, and generally results in small numbers when the window being looked for has a low variance, and large numbers otherwise. Also, of course, the sum is proportional to the number of points in the window, and varies with the number of bits per sample. Dividing it by the sum of the squares of the difference of each point in the source window from the mean of the window results in a satisfactory normalization. Note that this division is done on only once in a correlation search, on the result of the best match. Another property of the picture sequence is the fact that there is a fairly large amount of camera pan between some consecutive images. Since it is desirable to keep the search windows for the correlations as small as possible, my program first attempts to find a particular point on the horizon in both pictures, so as to line them up. The general flow for this portion of the program is as follows: Read in two pictures, clean and high pass filter them. Apply the interest operator in the first picture to find likely features. Pick the selected feature with the highest "interest" measure in the top third of the picture and, using a fairly wide search window, attempt to find it in the second. If the "goodness of match" measure described in the last paragraph indicates that this attempt has failed, pick another feature in the top third of the image and try again, unless there are no such features left untried, in which case terminate with the error message "Couldn't find horizon". Attempt to find the other features in the second image, in small search windows centered on the position within the image corresponding to the location of the features in the first picture, offset by an amount equal to the shift of the horizon feature found in the previous step. If the "goodness of match" indicates failure for any given feature, forget about that feature. Now that features have been located in two pictures it becomes possible to solve for their distances from the camera. If four or five points can be reliably found in both pictures, the relative location of the two cameras and the points, modulo a scaling factor, can be determined without any additional information. The pictures on which the first part of the program had been working rarely produced more than one or two good features so a naive solver of this kind was considered unreasonable. If some prior assumptions about the movement of the vehicle are made, fewer points are necessary to fix things. If we trust that the horizon feature found in the first part of the program really represents an object at virtual infinity, two additional points suffice. If the vehicle is assumed to move on a plane, and that the central axis of the camera is parallel to this plane, then one point is sufficient. Camera calculations based on this model were tried, and gave erratic results. The probable reason for this is that the assumption that the camera looks parallel to the road in the sample pictures is incorrect, and that in fact the camera was angled downwards. Another assumption that makes single point camera solving possible is that the camera is pointed down at some unknown angle, but otherwise looks in the direction of motion, and that the motion of the vehicle between the two pictures is a simple arc of a circle (of initially unknown radius). A solver based on this model worked a little better, but suffered from the fact that the relative change in the position of various features was either so large that the correlator was unable to find them, or so small that the solver results were numerically unstable. An obvious solution to this dilemma is to track a feature through several successive pictures, and to do the camera calculations with the first and last images in such a sequence. Since the correlator typically takes about four minutes (of raw run time) for a picture's worth of features, I decided to put off extending the program this way to sometime in the future, when my algorithms are transplanted to the SPS41, where they may, perhaps, run over ten times as quickly. The current form of the program uses the arc assumption, and eliminates any features which give erratic results under it. The Next Year Several ideas need investigation. The concept of using closely spaced pictures for correlation, but doing the camera solving over larger intervals, is one. Another potentially important technique to be tried concerns reducing the time required for the correlations. It may be possible to find the approximate location of a feature more quickly by reducing the source and search windows, that is substituting a single pixel containing the average of a number of adjacent picture points for those points. This can be done at several levels, with the correlation being done first at the coarsest, the process being analogous to a binary search (the current method being like a linear search). A related technique would use a fairly large search window, but with the points thinned out towards the edges, and dense in the center, so that a match must have a very good fit in the middle, and general agreement on a larger scale. This would make spurious matches less likely, without increasing the amount of computation inordinately. There is a certain fallacy in the current algorithm, which assumes that features will remain unchanged between pictures. In fact, with a wide field of view lens, such as the cart carries, there is an aberration towards the edges, which could be corrected for. Also, as features come closer to the camera they enlarge, and this could be anticipated by distorting the source window, guided either by a-priori information about the feature's position, or after it has been found approximately, or continuously during the search for it, guided by a camera model. The plan is to examine these and other ideas, to implement the better ones on the SPS and to incorporate them into a working system which (slowly) drives the cart down the road, avoiding obstacles and perhaps doing more interesting things (I expect that, with sufficient well designed primitives, interesting behaviour can be achieved with a relatively small amount of high level programming). Primitives which need to be thought about include an edge of the road detector, and perhaps some sort of template driven object recognizer.