The goal of this research was a program that could drive the cart
through realistic environments. I perceived the task to be barely
possible, and approached it in an uninhibited frame of mind; no code was
too dirty nor method too specialized, so long as it worked. Though I
expected to achieve some scientific insights, they would come as
byproducts of honest work, not from a direct assault.
\heading{Stereo Vision}
Computer vision, especially for rovers, is in such a tender and
undeveloped state (and so marginally feasible) that very little of the
prior progress in the field was of any help in writing my program.
Nevertheless, my efforts can be seen to have some relation to the other
work. With further development, some of the other methods might even be
useful in future cart-like systems.
For understandable reasons, much of the work on extracting three
dimensional information from pictures has been done with picture pairs, in
imitation of human binocular vision.
Viking Mars Lander picture pairs processed by Gennery \ref{G1}
originated from a camera pair. The Mariner 9, and Apollo 15 photograph
pairs processed by Quam and Hannah \ref{QH1}, were taken by the same camera at
different locations while the spacecraft were orbiting Mars and the Moon.
The aerial photograph pairs used by Arnold \ref{A1} were taken successively by
a single camera aboard an airplane flying over the scene.
There have been a few efforts besides mine that use more than two
pictures. Baker \ref{B4} followed edges between multiple views of an object
that rotated in front of a camera, and Baumgart \ref{B5} used the silhouettes
in images obtained similarly to build models of objects, except for
concavities. Burr and Chien \ref{BC1} use picture triplets. Nevatia \ref{N1}
suggested using multiple images, taken successively by a single camera.
He forsees using his techniques on orbital, planetary surface, or
industrial data. In the latter case, successive images might be available
of some mechanical part travelling along a conveyor belt.
Area correlation of small patches, used at times by Quam, Hannah,
Nevatia, Gennery and myself is not the only way to find corresponding
parts of two images.
Arnold \ref{A1} matches edges, obtained from each image using a
standard edge operator, as do Burr and Chien. Quam and Hannah \ref{QH1} match
small areas, then grow them into larger areas, to obtain matches between
whole shapes in the images. Marr and Poggio \ref{MP1} use a method which
examines the whole image and chooses the features matched in parallel with
doing the matching.
If a sampling of depths in the image, rather than a dense depth
map, is sufficient for an application, then one must select the points to
range. A good idea is to pick points in one image which will be easy to
find in a second. Pingle and Thomas \ref{PT1} chose windows with high
brightness variance, which also got a positive response from a corner
finder.
My interest operator is a cheaper version of the same idea, though
by depending on local maxima of an ``interestingness'' measure, it is more
tolerant of regions that have low contrast. Yakimovsky and Cunningham
\ref{YC1} use an autocorrelation technique that is almost equivalent to my
interest measure.
\heading{Gennery}
Although the current program works quite differently, my initial
approach to the obstacle avoider was with a vehicle carrying a single
fixed camera. The stereo baseline in that case would be generated by the
vehicle motion. Since the motion was known very imprecisely, it would
have to be deduced simultaneously with the feature distances. Gennery had
written a camera-solving subroutine which tackled this problem.
(Fortunately the whole approach proved too error-prone. If it had not, my
program would have had a major component that was someone else's prior
work. (shudder). It may still be used in future systems).
Gennery's camera solver \ref{G1} is able to determine five of the six
parameters which relate the relative position and orientation of two
cameras viewing the same scene, from the image plane co-ordinates of five
or more features points seen by both cameras. The algorithm determines
the azimuth and elevation of the second camera relative to the position of
the first, and the pan, tilt and roll of the second relative to the
orientation of the first. The sixth parameter, the distance between the
cameras, can not be determined, in principle, from image co-ordinates
alone.
Gennery has developed another program that may be useful to future
carts. This one deduces the position and shape of the ground directly
from a stereo pair of pictures of a littered scene \ref{G2}. It assumes the
ground surface can be approximated by a two dimensional second degree
polynomial (though a first degree fit can be forced {\sl a priori} if
desired). The algorithm (which begins by finding a rough camera model
using points found with my interest and correlation operators!) works
with a large number of three dimensional positions of points in the scene,
found with a high resolution correlator that gives probability ratings to
matches, based on a detailed statistical analysis of the errors expected
in two noisy images. It separates the points into ground and not ground
by successively doing a least squares fit, and then selecting all points
within tolerance of the candidate surface or below it for the next
fitting. This has the effect of finding lower and lower surfaces, with
non-ground clutter above them. Additional heuristics help eliminate
spurious points.
\heading{Hannah and Quam}
Quam and Hannah \ref{QH1} combined techniques they had developed
earlier, (Quam \ref{Q1} and Hannah \ref{H1}), to produce contour maps of portions
of the Lunar and Martian surfaces from stereo pairs of photographs from
the Mariner 9 and Apollo 15 spacecraft. They start their program by hand
by indicating a particular planetary surface point in both images. Around
this seed, the program grows a region of point pairings whose surrounding
windows show a local maximum at the same disparity. When all such
contiguous points have been exhausted, a nearby non-matching point is
taken, and a local search performed to find a corresponding point which
maximizes the correlation of the surrounding windows. If the correlation
is sufficiently high, the region growing procedure is repeated. If the
program fails to find any suitable new seeds, it appeals to the human
operator again. Finally the depth estimates are refined by interpolating
the correlation measure for a three by three cell around each point.
An early version of Gennery's camera solver then generates 3D
co-ordinates of the surface points, which are used to make contour maps
that sometimes look right.
\heading{Arnold}
Area correlation techniques tend to fail on object boundaries
because the edge of an object appears against a different background in
each picture. Arnold's stereo vision system \ref{A1} avoids this problem by
matching edgels (short edge segments), extracted from the images with the
Hueckel operator. He incorporates local context information in the
matching process, and eventually determines a depth map of height above
the ground, for the edges of the regions.
Edge detectors are unable to operate well in the presence of
texture or smooth shading, so Arnold suggests that eventually edge based
methods, and area correlation methods should be combined. I say the same
thing.
Arnold's system was developed on aerial photographs of airport and
parking lot scenes. After finding a camera model and a ground plane with
Gennery's solver, the program applies a 3.19 pixel radius circular Hueckel
operator to both images, typically producing around 1200 edgels. Each
edge has position, angle, and the brightness on each side. The spatial
information is normalized using the information from the camera solver and
ground finder, starkly limiting the search area for matching edgels.
Thresholds on the allowable differences between angles and brightness
levels further constrain the plausible matches. Typically there might be
8 possible matches in the right hand image for an edgel in the left.
Disparities are calculated for each of these possible matches and attached
to the edges in one of the pictures.
Local context further resolves the ambiguity. Two edgels are
considered to belong to the same physical edge if they are within 3 pixels
and $90\deg$ of each other, and the angle of a line connecting their
centers lies between their angles and the brightness levels are consistent
on at least one side. A voting scheme weighted by edgel distances from
each other and by disparity disagreements finds a likely disparity, and
hence a height above the ground, for such physical edges.
\heading{Burr and Chien}
Burr's and Chien's system \ref{BC1} is given piece-wise linear wire
frame models of objects to be found in scenes. It assumes three fixed
cameras, one in the center, another rotated $20\deg$ east, and a third
rotated $10\deg$ north. The program compares center and east pictures
when matching vertical edges, and center and north for horizontal edges.
Depth maps are made using the region matching algorithms of Hannah \ref{H1},
then an edge operator is used on the center image. Edgels are linked into
chains of near neighbors with consistent directionality. Then the depth
maps guide a local search in one of the two other images for a matching
edge. The 3D edges so obtained are matched against the wire frame models
to locate objects in the scene.
\heading{Mori et al.}
Mori et al. \ref{MKA1} try to make extremely accurate depth maps from
aerial photographs. The camera model is determined from accurately known
ground points. A fixed size correlation window is inadequate for the
desired accuracy because a patch of sloped surface can vary significantly
in size between two views. They make an initial approximation using a
standard feature correlator, then correct one of the images for
distortions that the estimated ground surface gradients would have caused.
The process is iterated, and eventually an accurate map is produced.
\heading{Yakimovsky and Cunningham}
Yakimovsky's and Cunningham's \ref{YC1} stereo program was developed
for the JPL Robotics Research Vehicle, a testbed for a Mars rover and
remote processing systems (Levine et al. \ref{LOY1} describe an earlier stereo
system using the same hardware). The RRV has a pair of rigidly attached
cameras, with highly linear lenses, so the search space is a very narrow
line parallel to the projection of the stereo axis. A small correlation
window is chosen in one image and expanded until its correlation with
itself shifted one pixel is sufficiently poor, showing it has adequate
information content, or until it gets too large (this is essentially an
interest operator step). Windows that pass this test are correlated with
the other image.
\heading{Marr and Poggio}
Marr and Poggio \ref{MP1} have a different approach. Their matching
pattern has no precise edges, it is fuzzy and implicit from a cooperative
(or relaxation) algorithm in which independent measurements made at nearby
parts of the image inhibit or reinforce each other. They start with two
general principles; Each image point may assigned at most one disparity
and, except for a few boundaries, disparity varies smoothly almost
everywhere.
For each pixel in one image and for a range of disparities there
is a node which represents the likelihood that the corresponding pixel in
the other image represents the same physical object. The nodes are
initialized by a simple pixel comparison. An iterative process then
repeatedly updates each pixel/disparity node with a normalized measure of
how surrounding nodes of the same disparity agree.
Marr's and Poggio's program works well on random-dot stereograms.
(Julesz \ref{J1}), taking about 15 iterations to produce a solid representation
of the hidden figures.
In 1979 Marr and Poggio \ref{MP2} published and Grimson and Marr \ref{GM1}
implemented a different approach that involves low pass filtering the images to
various resolutions, then correlating on the zero crossings in the
filtered images. A low frequency version provides a coarse positioning,
and the less filtered images are then used to refine the positions. In
some ways the process is similar to the binary search correlation method
described in this thesis.
\heading{Path Planning}
Plausible or minimum cost path finding is a simpler, and in an
abstract sense, less interesting problem than stereo vision. A number of
approaches have been published, some heuristic, many depending on exhaustive
search. Some of these depend intimately on a ``Manhatten distance''(runs
exactly in X or Y directions exclusively) geometry, which are good
for circuit board layout, but uninteresting to us.
Many of the more general approaches, including the one
implemented in the cart program, translate the problem into
a connected graph with arc costs. The best general solution
for finding the shortest distance in such a graph is still
the $O(n^2)$ algorithm published by Dijkstra \ref{D1}. My program uses a
variant of it.
A number of heuristic methods have been presented. SRI's Shakey
system used one such. A generalization is described by Hart, Nilsson and
Raphael \ref{HNR1}. They call it the A* algorithm.
Recently Lozano-P\'erez and Wesley \ref{LW1} published an obstacle
avoidance algorithm whose core is nearly identical to mine. Instead
of circular obstacles, their's avoids polygons. Instead of circle
to circle tangents, they consider vertex to vertex paths (which
is a little easier, because there are fewer vertices than tangents).
Their program, like mine, maps the possible path space into a graph,
and solves with an efficient graph algorithm.