CART PROJECT PROGRESS REPORT
HANS MORAVEC
JULY 24, 1974
revised JAN 7, 1975
The Previous Month
Following the cart meeting of June 13, in which it was agreed
that McCarthy would buy the cart project a TV transmitter if I could
demonstrate my ability to do work on vision by writing a program which
extracted three dimensional information from a motion stereo sequence
of pictures, I began work on this task. So that there would be no
doubt as to who had done the work, and because I operate most
comfortably and effectively in a programming environment made to my
own tastes, the first order of business was the writing of a set of
graphics and picture utility routines. The resulting package provides
more flexibility than Quam's III buffer based routines, and is more
cohesive than Baumgart's set, which is spread out among several
programs. Currently the package accesses Data Disc displays only, in
future it will be extended to provide output to the XGP and maybe the
III's. As an example of increased capabilities, it allows display of
filled in polygons (including convex and star polygons), and permits
displays (including grey scale pictures and text) to be shrunk or
expanded arbitrarily merely by changing the declared dimensions of the
screen or the requested sizes of the individual parts. Also, selected
areas of the display may be erased or complemented, as well as added
to. The programs described below use no previously existing software
other than the FAIL and SAIL compilers and the operating system.
A sequence of 30 pictures taken by Baumgart several years ago,
along the center of the road around the lab at about 12 ft intervals,
provided the data for the first try. The general approach I had in
mind was simply to isolate a reasonably small number of interesting
features in one picture, and to find these in the next one (or
several), locating the horizon and deducing from this the location of
the other points.
The first problem was the initial selection of promising
features in one image. I decided, for lack of being able to think of
anything better, that each feature would be a square window n pixels
by n pixels, n being an adjustable parameter. The problem now was to
select a scattering of such windows over the picture, preferably with
the property that the selected areas would be better candidates for
correlation than the neighboring regions. Variance, the sum of the
squares of the difference of the samples from the mean, is a
conceivable measure which can guide the program in the this selection,
but it is unable to distinguish genuinely good regions from those
which consist of parallel streaks in a particular direction. A
correlator can slide along such streaks without the calculated
correlation changing, not a desirable state of affairs. Discussions
did not reveal any established measures that were better.
Instead of using variance, I decided to create a measure of my
own with the right properties. My operator calculates, for each of
horizontal, vertical and two diagonal directions, the sum, over the
entire window, of the squares of the differences of pairs of pixels
adjacent in the indicated direction. The measure for the window is the
minimum of the four sums so obtained (I considered using the product
instead of the minimum, but this made the numbers unwieldy. I also
tried using absolute values instead of squares, but squares (which
accentuate large differences) work better (the squaring is done at
almost no expense by table lookup). Higher powers result in overly
large numbers, though this can be fixed by scaling. The possibility
of resolving direction more finely than the 45 degrees implied by my 4
sums also occurred to me, but in small windows of the kind I
contemplated, this didn't seem worthwhile).
This measure has the property that, if all the variation in a
window is in one direction, one of the sums, and consequently the
measure itself, will be close to zero. The measure is also small, of
course, if a region simply has little variation in general.
If the numbers in the digitized picture are proportional to
the logarithm of the amount of light (read "number of photons")
reaching the camera, the above measure has the desirable property of
being invariant under changes of illumination. If the numbers are
linear in the amount of light, the measure should be divided by the
square of the mean over the window, to remove the effect of
illumination. I found that doing this caused the dark areas to be
overly accentuated, suggesting that the camera photometry function is
more nearly logarithmic than linear. The measure finally decided upon
is the simple minimum of the four sums of squares of differences of
adjacent pixels.
Because it resulted in a faster running program, I decided to
partition the picture into disjoint n by n windows, to each of which
the operator is applied, instead of moving the operator over the
picture in single pixel steps. Doing this, and selecting those
windows where the operator resulted in a number bigger than a given
threshold (I usually use the average of the operator over all windows
for this), resulted in clumps of adjacent windows (e.g. over the
image of stands of trees) being selected. This was undesirable, since
it meant that many points corresponding to pretty much the same object
would have to be examined in the subsequent steps. I tried to solve
this by keeping only those windows which were not only above the
threshold but which were also a local maximum, determined by
comparisons with its eight neighbors. This weeding out resulted in a
gratifyingly sparse scattering of selections.
This "interest" operator, including the local maximum test,
was tightly hand coded for the PDP10, and observed to take about 6
seconds run time with a 250 by 250 picture and a window size of 10,
increasing slightly for smaller window sizes.
Contemplating the run time of this simplest and most tightly
coded of operators made it obvious to me that serious vision was
essentially impossible with computers of the kind we are currently
using (including the signal processor). A 500 by 500 point picture
has a quarter of a million pixels (not 25,000 as indicated in a
McCarthy article I recently ran across), and any operation applied to
all of it takes one second for every four microseconds spent on each
pixel. Even the simplest processing needs several instructions for
each pixel, occupying many microseconds, so the simplest operators
require many seconds to do a picture. Simple operators (like
correlation) which require a number of computations more than linear
in the number of pixels, take, at best, on the order of minutes. More
complicated operators can be conceived which take hours or days. In
practice these more complicated processes are not even tried, or are
quickly abandoned when their run time is noted.
It is my contention that much of the difficulty and lack of
success in computer vision stems from the fact that we are trying to
do it with a inanely small amount of compute power, and that passable
vision may not be possible without the equivalent of (say) 24 hours of
PDP 10 time per image (which would not be conducive to many
experiments being made). It is to be noted that the human eye
processes million point images at the equivalent rate of about ten per
second, and applies many complicated operators over the whole image in
the earliest stages of the optic nerve, with even more neural
circuitry devoted to vision occurring further downstream, in the
visual cortex. Similar observations may hold in other aspects of AI,
and it is possible that the attitude that serious results in this
field can be obtained with conventional computers is hardly more
reasonable than the opinion that the results can be obtained without
computers entirely, since we can always simulate the programs by hand,
on pencil and paper.
While thinking about this sad state of affairs, and about
other problems, a design for a multiprocessor, a small version of
which could be constructed even today with microprocessors, which
would be entirely general purpose, but specially applicable to vision
and physical simulations (e.g. three dimensional graphics) and
combinatorial problems, occurred to me. Further discussion of it is
out of place here, other than to say that it could reasonably be built
on a scale to equal or better the processing power of the optic nerve
or even the whole brain, while being easy (even fun) to program
(unlike, say, Illiac IV, which is really a special purpose machine).
The feature extraction routine now worked satisfactorily on
most cart images, but Baumgart's pictures are noisier than most and in
those the program tended to pick out many uninteresting patches of
road, as well as good features. My operator is quite sensitive to high
frequency noise, since the comparisons in it are strictly between
adjacent elements in the picture. I was able to alleviate the problem
by writing a preprocessor for the picture which constrains the value
of each pixel to the range delimited by its four (vertical and
horizontal) neighbors, that is it may be no bigger than the second
largest or smaller than the third largest of the neighbors. Points
above this maximum are set to it, and those below the minimum are set
to that. This has the effect of removing aberrations a single pixel
in extent in either direction, and noticeably blurs the picture (but
improves the functioning of my other operator). Surprisingly the run
time of this cleaning up process is longer (by about 50%) than of the
interest operator, in spite of the fact that it too is tightly hand
coded. These two operators can obviously be combined into a single
one which would run faster than the sum of the run times of both, but
at the moment doing this would probably be a waste of time since the
Baumgart pictures are atypical, and since better techniques for
finding interesting features in noisy pictures can and will
undoubtedly be thought of.
Now that interesting things in one image could be located, it
was time to write programs which could find them in the next. The
first problem here was to decide the measure by which a good match
would be detected. Quam uses normalized correlation, a measure
derived for finding the correlation between the x and y values of a
set of points in a two dimensional scatter diagram. If the values of
the pixels in each window of the two windows between which the
correlation is to be calculated are considered as an n^2 dimensional
vector (n^2 being the number of points in each window), then this
correlation coefficient is the normalized dot product of the two
vectors (i.e. the cosine of the angle between them) after each one has
had the mean of its components subtracted from each component. The
critical property is that it is invariant under a linear
transformation of the points in either window. This may be desirable
if it is expected that the illumination or albedo of an area can
change between the two pictures, as is the case in the Mars pictures,
(although it is difficult to imagine what multiplication by a constant
factor means if the encoding is the log of brightness), but probably
throws information away unnecessarily if the pictures are taken under
similar conditions (as on the cart). Also, it requires the
computation of three sums of squares in the inner loop of the
correlation (which happens a number of times approximately equal to
the number of pixels in the correlation window (n^2), multiplied by
the number of pixels in the search window (which can be a good
fraction of the entire image)). In addition, each trial window
requires a square root and several multiplications and divisions
(happens a number of times equal to the number of pixels in the search
window), which make a non trivial contribution to the run time.
An alternative measure is the sum of squares of differences,
which requires computation of only one sum in the inner loop, and
hardly any other computation. Implementation of an algorithm which
moved a small test window around a larger rectangular window,
calculating sums of squares of differences, and returned the window
position which resulted in the smallest such sum, provided moderately
satisfactory results, and revealed an unexpected problem. All of the
pictures it was tried on have a radial intensity gradient, running
from bright at the center to dark at the edges. This darkening is
caused by the fact that the vidicon target is flat (as opposed to
spherical), and light from the lens strikes it perpendicularly at the
center, and at a slant towards the edges, giving less light per unit
area there. It had the effect on my correlator of preventing a good
match when a feature was significantly closer to an edge (and hence
darker) in one picture than in the other. Normalized correlation
would, of course, not be affected by this, since it subtracts out the
mean of each window during the correlation. As mentioned before,
however, the penalty payed for this is the computation of three sums
in the inner loop of the correlation, instead of one. An operator
which could correct the variation once for all of each picture would
be considerably more efficient. The most satisfactory possibility
would be one which determined the extent of the darkening and removed
it, leaving the picture otherwise untouched. This is unfortunately
difficult to acheive, since the extent of the gradient varies from
camera to camera, day to day and scene to scene, and since, in a
single picture, the effects of gradient and real image are difficult
to isolate. A simpler fix is high pass filtering, which can be
implemented by subtracting the average of a window about each pixel
from the pixel. The combination of this preprocessing with the sum of
squares of differences correlation is similar in effect to subtracting
out the mean during the correlation, but faster, since the subtraction
is done only once or each pixel, instead of the n^2 times. It has the
drawback, in common with normalized correlation, that it unnecessarily
throws away some information. I tried it in spite of this, and found
that it solved the edge darkening problem, and did not noticeably
degrade the performance in other ways.
A related problem stemming from the geometry of the optics is
that there is a distortion which causes the images of objects to be
streched outwards an amount proportional to their distance from the
center, especially with a wide angle lens such as the 60 degree one
carried by the cart. Since the angular motion from frame to frame in a
cart sequence is primarily from side to side (pan), I was able to
correct for most of this effect by transforming the picture array, by
shifting columns around, in a way tantamount to projecting the
original picture onto a portion of a cylinder of correct radius. It is
necessary to know the angular extent represented by the original
picture to do this correctly.
A feature of Baumgart's sequence of pictures is that there are
few foreground features (mostly empty road), and those that there are
(e.g. occasional pieces of white line), are so changed from picture
to picture, due to the perspective change caused by the large forward
motion between pictures, or by the fact that they have moved out of
the field of view, as to be just about unrecognizable by human beings.
Needless to say, my program does no better, and usually finds the
wrong feature under such circumstances. One way of detecting such
lossages would be by means of a measure which indicates the goodness
of the match in some absolute way. The simple sum of squares of
differences does not have this property, and generally results in
small numbers when the window being looked for has a low variance, and
large numbers otherwise. Also, of course, the sum is proportional to
the number of points in the window, and varies with the number of bits
per sample. Dividing it by the sum of the squares of the difference
of each point in the source window from the mean of the window results
in a satisfactory normalization. Note that this division is done on
only once in a correlation search, on the result of the best match.
Another property of the picture sequence is the fact that
there is a fairly large amount of camera pan between some consecutive
images. Since it is desirable to keep the search windows for the
correlations as small as possible, my program first attempts to find a
particular point on the horizon in both pictures, so as to line them
up. The general flow for this portion of the program is as follows:
Read in two pictures, clean and high pass filter them.
Apply the interest operator in the first picture to find
likely features.
Pick the selected feature with the highest "interest" measure
in the top third of the picture and, using a fairly wide
search window, attempt to find it in the second. If the
"goodness of match" measure described in the last paragraph
indicates that this attempt has failed, pick another feature
in the top third of the image and try again, unless there
are no such features left untried, in which case terminate
with the error message "Couldn't find horizon".
Attempt to find the other features in the second image, in
small search windows centered on the position within the
image corresponding to the location of the features in the
first picture, offset by an amount equal to the shift of the
horizon feature found in the previous step. If the "goodness
of match" indicates failure for any given feature, forget
about that feature.
Now that features have been located in two pictures it becomes
possible to solve for their distances from the camera. If four or five
points can be reliably found in both pictures, the relative location
of the two cameras and the points, modulo a scaling factor, can be
determined without any additional information. The pictures on which
the first part of the program had been working rarely produced more
than one or two good features so a naive solver of this kind was
considered unreasonable. If some prior assumptions about the movement
of the vehicle are made, fewer points are necessary to fix things. If
we trust that the horizon feature found in the first part of the
program really represents an object at virtual infinity, two
additional points suffice. If the vehicle is assumed to move on a
plane, and that the central axis of the camera is parallel to this
plane, then one point is sufficient. Camera calculations based on
this model were tried, and gave erratic results. The probable reason
for this is that the assumption that the camera looks parallel to the
road in the sample pictures is incorrect, and that in fact the camera
was angled downwards.
Another assumption that makes single point camera solving
possible is that the camera is pointed down at some unknown angle, but
otherwise looks in the direction of motion, and that the motion of the
vehicle between the two pictures is a simple arc of a circle (of
initially unknown radius). A solver based on this model worked a
little better, but suffered from the fact that the relative change in
the position of various features was either so large that the
correlator was unable to find them, or so small that the solver
results were numerically unstable. An obvious solution to this dilemma
is to track a feature through several successive pictures, and to do
the camera calculations with the first and last images in such a
sequence. Since the correlator typically takes about four minutes (of
raw run time) for a picture's worth of features, I decided to put off
extending the program this way to sometime in the future, when my
algorithms are transplanted to the SPS41, where they may, perhaps, run
over ten times as quickly.
The current form of the program uses the arc assumption, and
eliminates any features which give erratic results under it.
The Next Year
Several ideas need investigation. The concept of using
closely spaced pictures for correlation, but doing the camera solving
over larger intervals, is one. Another potentially important
technique to be tried concerns reducing the time required for the
correlations. It may be possible to find the approximate location of a
feature more quickly by reducing the source and search windows, that
is substituting a single pixel containing the average of a number of
adjacent picture points for those points. This can be done at several
levels, with the correlation being done first at the coarsest, the
process being analogous to a binary search (the current method being
like a linear search). A related technique would use a fairly large
search window, but with the points thinned out towards the edges, and
dense in the center, so that a match must have a very good fit in the
middle, and general agreement on a larger scale. This would make
spurious matches less likely, without increasing the amount of
computation inordinately.
There is a certain fallacy in the current algorithm, which
assumes that features will remain unchanged between pictures. In
fact, with a wide field of view lens, such as the cart carries, there
is an aberration towards the edges, which could be corrected for.
Also, as features come closer to the camera they enlarge, and this
could be anticipated by distorting the source window, guided either by
a-priori information about the feature's position, or after it has
been found approximately, or continuously during the search for it,
guided by a camera model.
The plan is to examine these and other ideas, to implement the
better ones on the SPS and to incorporate them into a working system
which (slowly) drives the cart down the road, avoiding obstacles and
perhaps doing more interesting things (I expect that, with sufficient
well designed primitives, interesting behaviour can be achieved with a
relatively small amount of high level programming). Primitives which
need to be thought about include an edge of the road detector, and
perhaps some sort of template driven object recognizer.