The goal of this research was a program that could drive the cart through realistic environments. I perceived the task to be barely possible, and approached it in an uninhibited frame of mind; no code was too dirty nor method too specialized, so long as it worked. Though I expected to achieve some scientific insights, they would come as byproducts of honest work, not from a direct assault.

Computer vision, especially for rovers, is in such a tender and undeveloped state (and so marginally feasible) that very little of the prior progress in the field was of any help in writing my program. Nevertheless, my efforts can be seen to have some relation to the other work. With further development, some of the other methods might even be useful in future cart-like systems.

For understandable reasons, much of the work on extracting three dimensional information from pictures has been done with picture pairs, in imitation of human binocular vision.

Viking Mars Lander picture pairs processed by Gennery \ref{G1} originated from a camera pair. The Mariner 9, and Apollo 15 photograph pairs processed by Quam and Hannah \ref{QH1}, were taken by the same camera at different locations while the spacecraft were orbiting Mars and the Moon. The aerial photograph pairs used by Arnold \ref{A1} were taken successively by a single camera aboard an airplane flying over the scene.

There have been a few efforts besides mine that use more than two pictures. Baker \ref{B4} followed edges between multiple views of an object that rotated in front of a camera, and Baumgart \ref{B5} used the silhouettes in images obtained similarly to build models of objects, except for concavities. Burr and Chien \ref{BC1} use picture triplets. Nevatia \ref{N1} suggested using multiple images, taken successively by a single camera. He forsees using his techniques on orbital, planetary surface, or industrial data. In the latter case, successive images might be available of some mechanical part travelling along a conveyor belt.

Area correlation of small patches, used at times by Quam, Hannah, Nevatia, Gennery and myself is not the only way to find corresponding parts of two images.

Arnold \ref{A1} matches edges, obtained from each image using a standard edge operator, as do Burr and Chien. Quam and Hannah \ref{QH1} match small areas, then grow them into larger areas, to obtain matches between whole shapes in the images. Marr and Poggio \ref{MP1} use a method which examines the whole image and chooses the features matched in parallel with doing the matching.

If a sampling of depths in the image, rather than a dense depth map, is sufficient for an application, then one must select the points to range. A good idea is to pick points in one image which will be easy to find in a second. Pingle and Thomas \ref{PT1} chose windows with high brightness variance, which also got a positive response from a corner finder.

My interest operator is a cheaper version of the same idea, though by depending on local maxima of an “interestingness” measure, it is more tolerant of regions that have low contrast. Yakimovsky and Cunningham \ref{YC1} use an autocorrelation technique that is almost equivalent to my interest measure.

Although the current program works quite differently, my initial approach to the obstacle avoider was with a vehicle carrying a single fixed camera. The stereo baseline in that case would be generated by the vehicle motion. Since the motion was known very imprecisely, it would have to be deduced simultaneously with the feature distances. Gennery had written a camera-solving subroutine which tackled this problem. (Fortunately the whole approach proved too error-prone. If it had not, my program would have had a major component that was someone else's prior work. (shudder). It may still be used in future systems).

Gennery's camera solver \ref{G1} is able to determine five of the six parameters which relate the relative position and orientation of two cameras viewing the same scene, from the image plane co-ordinates of five or more features points seen by both cameras. The algorithm determines the azimuth and elevation of the second camera relative to the position of the first, and the pan, tilt and roll of the second relative to the orientation of the first. The sixth parameter, the distance between the cameras, can not be determined, in principle, from image co-ordinates alone.

Gennery has developed another program that may be useful to future
carts. This one deduces the position and shape of the ground directly
from a stereo pair of pictures of a littered scene \ref{G2}. It assumes the
ground surface can be approximated by a two dimensional second degree
polynomial (though a first degree fit can be forced *a priori* if
desired). The algorithm (which begins by finding a rough camera model
using points found with my interest and correlation operators!) works
with a large number of three dimensional positions of points in the scene,
found with a high resolution correlator that gives probability ratings to
matches, based on a detailed statistical analysis of the errors expected
in two noisy images. It separates the points into ground and not ground
by successively doing a least squares fit, and then selecting all points
within tolerance of the candidate surface or below it for the next
fitting. This has the effect of finding lower and lower surfaces, with
non-ground clutter above them. Additional heuristics help eliminate
spurious points.

Quam and Hannah \ref{QH1} combined techniques they had developed earlier, (Quam \ref{Q1} and Hannah \ref{H1}), to produce contour maps of portions of the Lunar and Martian surfaces from stereo pairs of photographs from the Mariner 9 and Apollo 15 spacecraft. They start their program by hand by indicating a particular planetary surface point in both images. Around this seed, the program grows a region of point pairings whose surrounding windows show a local maximum at the same disparity. When all such contiguous points have been exhausted, a nearby non-matching point is taken, and a local search performed to find a corresponding point which maximizes the correlation of the surrounding windows. If the correlation is sufficiently high, the region growing procedure is repeated. If the program fails to find any suitable new seeds, it appeals to the human operator again. Finally the depth estimates are refined by interpolating the correlation measure for a three by three cell around each point.

An early version of Gennery's camera solver then generates 3D co-ordinates of the surface points, which are used to make contour maps that sometimes look right.

Area correlation techniques tend to fail on object boundaries because the edge of an object appears against a different background in each picture. Arnold's stereo vision system \ref{A1} avoids this problem by matching edgels (short edge segments), extracted from the images with the Hueckel operator. He incorporates local context information in the matching process, and eventually determines a depth map of height above the ground, for the edges of the regions.

Edge detectors are unable to operate well in the presence of texture or smooth shading, so Arnold suggests that eventually edge based methods, and area correlation methods should be combined. I say the same thing.

Arnold's system was developed on aerial photographs of airport and parking lot scenes. After finding a camera model and a ground plane with Gennery's solver, the program applies a 3.19 pixel radius circular Hueckel operator to both images, typically producing around 1200 edgels. Each edge has position, angle, and the brightness on each side. The spatial information is normalized using the information from the camera solver and ground finder, starkly limiting the search area for matching edgels. Thresholds on the allowable differences between angles and brightness levels further constrain the plausible matches. Typically there might be 8 possible matches in the right hand image for an edgel in the left. Disparities are calculated for each of these possible matches and attached to the edges in one of the pictures.

Local context further resolves the ambiguity. Two edgels are considered to belong to the same physical edge if they are within 3 pixels and 90° of each other, and the angle of a line connecting their centers lies between their angles and the brightness levels are consistent on at least one side. A voting scheme weighted by edgel distances from each other and by disparity disagreements finds a likely disparity, and hence a height above the ground, for such physical edges.

Burr's and Chien's system \ref{BC1} is given piece-wise linear wire frame models of objects to be found in scenes. It assumes three fixed cameras, one in the center, another rotated 20° east, and a third rotated 10° north. The program compares center and east pictures when matching vertical edges, and center and north for horizontal edges. Depth maps are made using the region matching algorithms of Hannah \ref{H1}, then an edge operator is used on the center image. Edgels are linked into chains of near neighbors with consistent directionality. Then the depth maps guide a local search in one of the two other images for a matching edge. The 3D edges so obtained are matched against the wire frame models to locate objects in the scene.

Mori et al. \ref{MKA1} try to make extremely accurate depth maps from aerial photographs. The camera model is determined from accurately known ground points. A fixed size correlation window is inadequate for the desired accuracy because a patch of sloped surface can vary significantly in size between two views. They make an initial approximation using a standard feature correlator, then correct one of the images for distortions that the estimated ground surface gradients would have caused. The process is iterated, and eventually an accurate map is produced.

Yakimovsky's and Cunningham's \ref{YC1} stereo program was developed for the JPL Robotics Research Vehicle, a testbed for a Mars rover and remote processing systems (Levine et al. \ref{LOY1} describe an earlier stereo system using the same hardware). The RRV has a pair of rigidly attached cameras, with highly linear lenses, so the search space is a very narrow line parallel to the projection of the stereo axis. A small correlation window is chosen in one image and expanded until its correlation with itself shifted one pixel is sufficiently poor, showing it has adequate information content, or until it gets too large (this is essentially an interest operator step). Windows that pass this test are correlated with the other image.

Marr and Poggio \ref{MP1} have a different approach. Their matching pattern has no precise edges, it is fuzzy and implicit from a cooperative (or relaxation) algorithm in which independent measurements made at nearby parts of the image inhibit or reinforce each other. They start with two general principles; Each image point may assigned at most one disparity and, except for a few boundaries, disparity varies smoothly almost everywhere.

For each pixel in one image and for a range of disparities there is a node which represents the likelihood that the corresponding pixel in the other image represents the same physical object. The nodes are initialized by a simple pixel comparison. An iterative process then repeatedly updates each pixel/disparity node with a normalized measure of how surrounding nodes of the same disparity agree.

Marr's and Poggio's program works well on random-dot stereograms. (Julesz \ref{J1}), taking about 15 iterations to produce a solid representation of the hidden figures.

In 1979 Marr and Poggio \ref{MP2} published and Grimson and Marr \ref{GM1} implemented a different approach that involves low pass filtering the images to various resolutions, then correlating on the zero crossings in the filtered images. A low frequency version provides a coarse positioning, and the less filtered images are then used to refine the positions. In some ways the process is similar to the binary search correlation method described in this thesis.

Plausible or minimum cost path finding is a simpler, and in an abstract sense, less interesting problem than stereo vision. A number of approaches have been published, some heuristic, many depending on exhaustive search. Some of these depend intimately on a “Manhattan distance” (runs exactly in X or Y directions exclusively) geometry, which are good for circuit board layout, but uninteresting to us.

Many of the more general approaches, including the one implemented in the cart program, translate the problem into a connected graph with arc costs. The best general solution for finding the shortest distance in such a graph is still the $O(n^2)$ algorithm published by Dijkstra \ref{D1}. My program uses a variant of it.

A number of heuristic methods have been presented. SRI's Shakey system used one such. A generalization is described by Hart, Nilsson and Raphael \ref{HNR1}. They call it the A* algorithm.

Recently Lozano-Perez and Wesley \ref{LW1} published an obstacle avoidance algorithm whose core is nearly identical to mine. Instead of circular obstacles, their's avoids polygons. Instead of circle to circle tangents, they consider vertex to vertex paths (which is a little easier, because there are fewer vertices than tangents). Their program, like mine, maps the possible path space into a graph, and solves with an efficient graph algorithm.