Robot Spatial Perception
by Stereoscopic Vision
and 3D Evidence Grids

Hans Moravec
Robotics Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA

September 1996


Acknowledgements: Thanks to my hosts Frieder Lohnert, Dimitris Georgianis, Volker Hansen and the other members of Daimler Benz Research, Berlin, who provided the encouragement and means to complete this work. Special thanks go to Ralph Sasse for several times providing critical help in key areas, taking time from his own consuming research. Thanks also to my CMU support network, especially Martin Martin, Mike Blackwell and Kevin Dowling. A major component of the program described here, the evidence ray thrower, was written during my 1992 sabbatical year at Thinking Machines Corp. in Cambridge, Massachusetts - many thanks for that opportunity. Long term support for this work was provided since 1982 by the US Office of Naval Research under N00014-93-1-0765 and previous contracts. The impetus for the evidence grid idea came from a research contract from Denning Mobile Robotics Inc. in 1983. Final thanks go to Christopher Priem for so carefully studying my program, and generating the large data set whose images are found in the report's appendix.


Abstract

Very encouraging results have been obtained from a new program that derives a dense three-dimensional evidence grid representation of a robot's surroundings from wide-angle stereoscopic images. The program adds several spatial rays of evidence to a grid for each of about 2,500 local image features chosen per stereo pair. It was used to construct a 256x256x64 grid, representing 6 by 6 by 2 meters, from a hand-collected test set of twenty stereo image pairs of an office scene. Fifty nine stereo pairs of an 8 by 8 meter laboratory were also processed. The positive (probably occupied) cells of the grids, viewed in perspective, resemble dollhouse scenes. Details as small as the curvature of chair armrests are discernible. The processing time, on a 100 MIPS Sparc 20, is less than five seconds per stereo pair, and total memory is under 16 megabytes. The results seem abundantly adequate for very reliable navigation of freely roaming mobile robots, and plausibly adequate for shape identification of objects bigger than 10 centimeters. The program is a first proof of concept, and awaits optimizations, enhancements, variations, extensions and applications.



Introduction

This report describes a new program that transforms stereoscopic images, as could be obtained from a camera-equipped roving robot, into dense three-dimensional maps of the robot's immediate surroundings. The quantity of good map data generated by the program is 100 to 1,000 times greater than previous systems relying on 2D or sparse 3D representations, suggesting that the statistical reliability of navigation programs using on it should be very high. The program was tested with 20 stereo image pairs of an office, and a second data set of 59 stereo pairs of a larger laboratory. The resulting 3D maps show object shape detail to a scale of about 10 cm. By using higher resolution grids, features of a few centimeters can be resolved.

Before mapping, the stereo camera fixture is calibrated. The fixture is leveled, and positioned, with reasonable accuracy, perpendicularly a known distance in front of a screen patterned with a precise square array of about 400 black spots on a white background. A program designed to deal with the fish-eye distortion of wide angle lenses, finds the spots in the cameras' digitized images, and constructs a "rectification" function for each camera that corrects for the distortion and small mounting misalignments. In use, the rectification functions are compiled into image-sized lookup tables, which geometrically transform raw camera images so they appear to have come from a specified ideal optical geometry.

During mapping, a sequence of stereo image pairs is processed. The results accumulate in a three-dimensional array called an "evidence grid", whose cells represent regions of space. Each grid cell accumulates evidence, positive and negative, that its region is occupied. The grid cells are initialized to zero, indicating no evidence for or against the cell being occupied. After sufficient data has accumulated, blocks of negative cells indicate free space, while positive cells define objects. The program's computationally efficient additive "weight of evidence" metric can be interpreted in probability theory as Bayesian combination using a log(p/(1-p)) "log-odds" representation of probability.

Processing of each stereo pair begins with image rectification. An "interest operator" then selects about 2,500 local features in the two images, each chosen to be a good candidate for locating in the other image of the pair. For each, a "correlator" scans the possible corresponding locations in the other image, producing a curve of goodness of match. The peaks in this curve each correspond to a possible distance for the feature. For each of the possibilities, the program throws two rays of evidence into the grid, one from each camera position towards the inferred feature position. The rays have positive evidence of occupancy at the feature, and negative evidence between the camera and feature positions, and are weighted for the estimated overall probability of correctness of each peak.

In our example, we used a grid 256 cells wide by 256 cells deep by 64 cells high, scaled to represent a volume 6 meters by 6 meters by 2 meters. Images in the data set were obtained by moving the tripod-mounted stereo fixture, tilted down about 15 degrees, by hand, to vertices on a 50 cm floor grid, maintaining parallel camera orientations. The resulting grid was displayed in multiple 2D X, Y and Z slices, with probabilities shown in gray scale. A more wholistic viewing method generated perspective 3D images of the positive cells of the grid, with cells in about a dozen box-shaped regions, containing major objects, distinctively colored. The latter display clearly shows the shape of furniture sized objects in the scene, with a level of detail reminiscent of doll-house pictures.

Results from a second, larger, dataset of 59 stereo pairs of an 8 by 8 meter laboratory are shown in the appendix.

The existing program succeeds as a proof of concept, but awaits improvement in nearly every detail. Some of the work will likely be accomplished by a learning procedure, that adjusts many program parameters to maximize a quality criterion. A promising criterion is a comparison of some of the original images of the scene with corresponding "virtual" images of the grid. The grid will first be colorized in original scene colors, by back-projecting other original images into the grid's occupied cells.

Research Background

The results described here build on 25 years of work in mobile robot perception. From 1971 to 1983 the author and colleagues developed stereo-vision-guided robots that drove through clutter by tracking a few dozen image features, presumed to be parts of objects, in camera images. Through several versions, the control program gained speed and accuracy, but a brittle failure mode persisted, misdirecting the robot after approximately 100 meters of travel, when chance clusters of tracking errors fooled geometric consistency checks [Moravec83].

We invented the so-called evidence grid approach in 1983, to handle data from inexpensive Polaroid sonar devices, whose wide beams leave angular position ambiguous. Instead of determining the location of objects, the grid method accumulated the "objectness" of locations, arranged in a grid, slowly resolving ambiguities about which places were empty, and which filled, as the robot moved. The first implementation worked surprisingly well, showing none of the brittleness of the old approach. It could repeatedly map and guide a robot across a cluttered test lab [Moravec-Elfes85]. It worked on a tree lined path, in a coal mine, with stereo vision range data, combined stereo and sonar, and with probability theory replacing an ad-hoc formulation. Its first failure, in uncluttered surroundings with smooth walls, led us to a major extension, the learning of sensor evidence patterns. The evidence contributed by individual sonar readings had been hand-derived from the sensor's signal pattern, a poor model for interaction with mirrorlike walls. We are now able to train the program to work nicely in mirror surroundings, and superbly elsewhere [Moravec-Blackwell93]. A summary of the work to date is found in [Martin-Moravec96].

Past work was with 2D grids of a few thousand cells, all that 1980s computers could handle in near real time. In 1992 we wrote a very efficient implementation of the central operation for three dimensions, that can throw thousands of evidence rays per second into 3D grids with several million cells, on 100 MIPS computers. The work described in this report combines that program with several new components into a complete robot stereoscopic 3D mapping package.

Camera Calibration and Image Rectification

The "flatfish" calibration program assumes its images come from solid state cameras with geometrically precise imaging surfaces, and good quality optics whose distortions are rotationally symmetric about their optical axis. The program constructs rectification functions that transform slightly off-center, rotated fish-eye images from imperfectly mounted wide-angle cameras into images exhibiting specified simple projective geometry, suitable for direct use by stereoscopic vision programs. Though designed for wide-angle optics, it also works with narrow angle lenses exhibiting little distortion.

For each camera, the program inputs an image of a calibration spot pattern, an aim point on the pattern, the measured distance between the lens center and the pattern, and the desired angular field of view of the rectified image. It produces a file of calibration parameters which contains the necessary information to rectify images from the same camera to these specifications. The program also contains code which implements the rectification. A version of this rectification code is packaged in "fisheye", a set of image processing routines, where it is used to compile a lookup table for rapidly applying the rectification. A side effect of running "flatfish" is a trace file and about ten black and white and color images showing its working stages.

The calibration pattern is a wall-mounted square array of dark spots on a light background (figure 1), with one additional spot at a half-grid position at the center, serving as an absolute position reference. We have used dark magnetic spots on a steel whiteboard, and also a 2 meter by 1.5 meter computer printout. The program expects the pattern to be 15 to 35 spot spacings wide, and the spot spacing to be about twice the spot diameter, with wide tolerances. The expectations are defined by program parameters.

The calibration spot input image is assumed to come from a moderately accurate physical setup, with the camera line of sight as perpendicular as possible to the plane of the spot pattern arrayed in front of it, and its pixel axes nearly parallel to the rows and columns of the spot array. The program uses first-order corrections to remove the remaining angular errors in azimuth, elevation and roll, as well as scaling and recentering the result to the desired field of view, after removing radial fish eye distortion and non-unity aspect ratio. To calibrate a pair of stereo cameras, both cameras, held in their stereoscopic mounting configuration, are aimed at the same calibration pattern, as in figure 1. The flatfish program is run for each resulting image, with only the aim point on the calibration pattern specified differently for the two calibrations. The aim point for each camera defines the rectified view direction, and the separation of the two aim points on the pattern should be specified equal to the physical separation of the cameras.


FIGURE 1: A stereo camera assembly being calibrated.


"Flatfish" begins by applying a spot operator to its image. At each pixel position this operator weighs hypotheses that there is a spot of a range of radii centered at the pixel. For each radius hypothesis, the operator calculates the means and variances of the intensities of the spot interior and an annular surrounding region. The score for the hypothesis is the outer minus the inner mean, minus fractions of the variances. The exact variance fractions control the program's tolerance for non-uniform lighting and glare in the background and spot areas. The "spotness"value for the pixel is the maximum value for all the radius hypotheses. That spotness value and the corresponding radius are recorded for each pixel. The program applies the operator up to the edge of the image by giving zero weight to portions of the spot mask that lie beyond the image boundaries. For a wide range of settings of the spot size range and variance fractions, the operator is positive in small central areas of all spots in a calibration image, and negative elsewhere. It is negative everywhere in most general scenes not containing properly sized spot patterns. It is a very trustworthy operator.

After calculating spotness at every pixel, the program catalogs local maximum spot values and the corresponding spot radius, thus identifying each spot in the image. It finds the central spot in the pattern by noting which is closest to four other spots, relative to their spot sizes. It links the other, non-central, spots into a grid, starting with a spot near the middle of the image, and using a rough idea of the spot spacing based on spot size to repeatedly find horizontally and vertically adjacent neighbors. At each step it uses the size and direction of the previous steps to adjust its rough idea of the local spot spacing and orientation. By this means it tracks the significant scale and orientation distortions found in the edges and corners of wide-angle images. The program assigns [row, column] coordinates to spots. The central spot has coordinates [0,0], the spot to its lower right is [0.5, 0.5], the spot below that is [1.5, 0.5], and so on. Via interpolation, each pixel in the image is then identified both by these "spot coordinates" as well as its original pixel coordinates.

The program locates the optical axis in the image by finding the image location OA (and incidentally also the aspect ratio adjustment) that minimizes the scatter when the radial distance of each spot from OA, measured in pixel coordinates, is plotted against the same radial distance measured in spot coordinates. The shape of the spot versus pixel radius curve characterizes the fish-eye distortion, and is represented in the program by a least-squares best fit polynomial. With a 90 degree field of view lens, we found that the pixel coordinates of the optimum OA remained stable within a few hundredths of a pixel as the camera was moved to various positions in front of a calibration grid. The root-mean-square deviation of the pixel versus spot radius scatter diagram, when OA was chosen randomly near the center of the image was about 2.5 pixels. The rms error was reduced to less than 0.5 pixels at the optimum OA (figure 3).

In another least-squares step, the program then finds and applies the image rotation (roll) that most nearly makes the fish-eye corrected spot pattern line up with the pixel rows and columns. It also shifts and scales the image to bring the requested aim point to the exact center, and to provide exactly the requested field of view across the width of the image raster. The largest source of error is probably in the user's estimate of the distance between the lens focal plane and the test pattern, which affects the true angle of view. We use the lens iris ring as an estimate for the focal plane. The uncertainty diminishes as the pattern and the camera distance are made larger, or the lens is made smaller.

The final result is encoded as parameters for an image rectification program. These parameters are read in by stereo programs, and then expanded into rectification tables, which have, for each pixel in the rectified image, the coordinates of the corresponding source pixel in the raw image. During a mapping run, each imaged digitized by a particular camera is transformed into a rectified image via the table corresponding to that camera.


FIGURES 2, 3, 4, and 5: Camera calibration results: "flatfish" applied to images from 60, 90, 30 and 120 degree field of view cameras. The .a figures are the original images, .b show the regions where the spot operator is positive (small light splotches in the center of the black spots), and the coordinate grid resulting from linking the spot operator maxima. The .c and .d graphs are plots of pixel coordinate radius versus spot coordinate radius for all the spots, .c for an arbitrarily chosen center marked by a small cross on a spot near the center in the .b image, .d from the position that minimized the scatter in the graph, marked by a large X. The program assumes the scatter-minimizing center marks the true optical axis. The shape of the curve defined by the scatter diagram is the radial distortion function. The .e image is the original spot image rectified by the function constructed by "flatfish", and overlaid with a grid whose spacing should match the spacing of the spots in the rectified image. Figure 5, derived from an image whose field of view was 120 degrees, was rectified to only 90 degrees of view, to eliminate the vignetted corner black areas.


Stereoscopic Mapping Framework

The image set used in this paper was collected with a pair of Sony XC-999 color cameras, with 6 mm lenses, mounted securely in parallel 15.6 cm apart on an aluminum bar. Twenty stereo pairs of images were collected reasonably carefully by hand, 40 images in total, of one end of an office, in parallel directions from locations on an approximately 50 cm grid on the floor, with the cameras 1.25 meters high, angled 14 degrees down. The scene included portions of walls, two office chairs, two large cabinets and an open door with furniture beyond. One of the cabinets was fully open, and contained a long raincoat and several shelves with small objects. The pictures were originally digitized in color, to a resolution of 768 by 576 pixels, then reduced to grayscale. The resulting images "scan.*.L.pgm" and "scan.*.R.pgm" along with a description of their imaging geometry encoded in a file "run1.nav", and the two camera rectification files mentioned above "sony60.L.cal" and "sony60.R.cal", were processed by the stereo mapping program, currently named "crayfish" into a 256 by 256 by 64 cell evidence grid representing a volume 6 meters by 6 meters by 2 meters high.

"Crayfish" is invoked with the name of a ".nav" file. Its overall flow is as follows:

begin
Read ".nav" file
Read and expand ".calib" files (indicated in .nav file)
Assign image storage
Initialize 3d evidence grid and sensor models
Loop on images indicated in ".nav" file
{
Read L and R image
Apply Interest operator to find distinctive areas in both images
Loop on windows selected by Interest operator
{
Correlate window with possible locations in other image
Extract probability-weighted peak list from correlation curve
Loop on correlation peak list
{
Calculate 3D location for this peak
Throw two probability-weighted evidence rays
}
}
}
Write out 3D evidence grid
end

A program "wrapfish" takes the 3D evidence grid made by "crayfish" and generates viewable images. As of August 1996, it makes X, Y and Z "slice" images showing all grid planes parallel to the coordinate planes, with very high occupancy probability cells shown in white, very low probabilities shown in black, and intermediate probabilities in shades of gray. It also produces three dimensional views of the occupied cells of the grid, and a 3D "Open Inventor" file of the occupied cells that can be viewed interactively on Silicon Graphics workstations. The visual quality of the 3D presentations is greatly enhanced by a manual colorization step that spotlights about a dozen box-shaped regions covering objects like the floor, walls, cabinets and chairs. In each of these boxes, a region-specific distinctive color is given to all the occupied cells, helping human observers of perspective views distinguish the contents from foreground and background.

The "flayfish" program, derived from "crayfish", includes additional code to optimize (learn) good parameters shaping the sensor model that defines the evidence rays. It repeats the "crayfish" outer loop and its immediate initialization with different settings of sensor model parameters looking for combinations the maximize a map quality measure.

"Crayfish" invokes the separately compiled procedures in files "fisheye" and "volsense". "Volsense" contains code for manipulating 3D evidence grids. "Fisheye" has procedures for expanding calibration files into rectification tables, and for applying the rectifications to images. It also holds "interest operator" and "correlator" procedures for stereoscopic processing. Following sections describe "crayfish"'s major components.

Interest Operator

The central image processing step in "crayfish" is the identification of small matching windows in both images of a stereo pair. The position of the matching window in either image defines a 3D ray from the camera, and the relative displacement of the window from one picture to the other defines a 3D position along that ray. The program interprets such matches as evidence for the existence of a visible feature in that 3D location.

Not all locations in an image are suitable for matching. Large areas of blue sky or white wall, for instance, look identical, as do linear regions along simple edges, with no way to distinguish one particular small patch from a nearby one. An "interest operator" is applied to an image to select regions likely to be unambiguously matched in its stereo partner. We invented the concept of the interest operator in 1974, along with the idea of correlation through a coarse-to-fine image pyramid, and used them in the first round of work leading to the present results\ [Moravec81]. Interest operators and image pyramid correlation are now widely used in the initial registration steps of digital photogrammetry [Hellwich+al94] [Stunz-Knopfle-Roth94]. Robot stereoscopy is much sloppier than photogrammetric stereoscopy because robots, immersed in their scene, encounter visual occlusions, perspective scale and view angle distortions, oblique featureless and specular surfaces, enormous depth ratios and other complications.

After a stereo pair of images is rectified, the program applies an interest operator to the left half of the right camera's image, and to the right half of the left camera's image. Features chosen from those half images are most likely to be present in the other camera's image, while still covering the full field of view. A quirk of the approach is that a narrow stripe the width of the camera separation, running to infinity, is seen in both half images, and thus examined doubly.

The interest operator is a variant of one used in our early research. It subdivides the image into a grid of non-overlapping 8x8 windows, and on each computes the sum of squares of differences of pixels adjacent in each of four directions: horizontal, vertical and right and left diagonals. The raw interest measure for each window is the minimum of these four sums, representing the weakest directional variation. If this weakest direction has some variation, the feature is neither a uniform area or a simple edge.

Our 1974 interest operator picked only the local maxima of the raw interest measure. The 1996 program applies a high pass filter over the array of interest windows, subtracting the average interest value of the eight surrounding neighbors from the value of each window. Windows are chosen for ranging if they are positive after the high pass operation. The idea in both cases is to cover the image as uniformly as possible, while picking the best possible features in each area. The 1974 program found less than 50 features per stereo set, the 1996 program extracts about 2,500 (figure 6).

The nominal displacement of matching features between the image pairs, from perfectly horizontally mounted and aligned stereo cameras, is purely horizontal - the so-called epipolar constraint. If true, it eliminates the need for vertical correlation search, and allows the use of any features with horizontal variation, for instance simple vertical or diagonal edges. The "flatfish" calibration and rectification process is intended to simulate perfect horizontal geometry, but we have possibly observed that lens iris and focus adjustments can cause image shifts on the order of one pixel, spoiling perfect calibrations. Mechanical shocks might have a similar effect. We are able to detect and correct tiny vertical misalignments after calibration by occasionally permitting a small vertical extent in the horizontal correlation searches, and noting the vertical offset of each best match. A histogram of these offsets from the features from any of the 20 stereo pairs in the data set of this report has a peak at about 3/4 scanline offset. Shifting one image of each pair by one scanline approximately compensates for the misalignment. After the shift, we treat the images as perfectly aligned.

"Crayfish" exploits the perfect alignment to get more features from the image and to minimize the correlation search. Rather than the full interest operator described above, it uses a version that measures only horizontal variation (figure 7). Unlike the full interest operator, the horizontal-only version strongly registers simple vertical edges, which are very common in indoor scenes.


FIGURES 6 and 7: Interest operators applied to a stereo pair. Each image is the juxtaposition of the left half of a right camera image, and the right half of a left camera image of a stereo pair, a configuration which minimizes out-of-bounds problems in correlation searches. In each 8x8 window, an X indicates a very strong interest measure, a dot indicates a weak positive measure. Figure 6 shows the full interest operator, which requires variation in all directions, used to determine fine vertical alignments of images. Figure 7 shows the horizontal-only interest operator, used to interpret aligned stereo images.


Correlation

The "correlator" locates image patches in one image that correspond to interest operator choices in the other image of a stereo pair. As with interest operators, there are two kinds of correlator in "fisheye", one which searches horizontal and vertical extents, and the other which searches only horizontally. The former is used in occasional image fine alignment steps, the latter is used routinely to interpret stereo scenes.

Omitting complications, described below, the correlator computes a match value between a patch in one image and a corresponding patch around every pixel position in a search area in the other image of a stereo pair. In the existing code, the patches are 7x7 pixel squares. Other sizes also work, and soon we may try circular shapes. The comparison measure is the sum of squares of differences of corresponding pixels, with the window means subtracted:

Sum(((a-Sum(a)/n)-(b-Sum(b)/n))^2) = Sum((a-b)^2) - (Sum(a)-Sum(b))^2/n

with a and b, implicitly subscripted 1 to n, representing the n pixels in two overlayed windows.

The horizontal-only correlator returns a "match curve" array of correlation values, one for each pixel position in its search. Traditional stereo programs simply select the best match in such curves, but our program processes multiple candidate matches. The full-area correlator searches a rectangle in an image, but returns more information about the horizontal direction. For each horizontal position, it searches the vertical range for the best match, and returns that value in the "match curve" array. A parallel array gets the vertical offset that produced that best match.

The known camera separation, position and orientation for each set of rectified images determines the bounds of the correlation search and the interpretation of the results. Each candidate match is interpreted as a possible 3D surface feature at a particular heading and distance. The evidence ray thrower, described in the next section, adds evidence to 3D grids along narrow cones from the camera positions to the surface feature. With 20 image pairs and 2,500 correlation searches per pair, the amount of evidence accumulated in our 4 million cell grid is substantial, allowing for quite sensitive statistical evaluations of small changes in the program. A very simple such measure is the count of the number of positive cells found in the floor plane of the grid, which are produced by weak correlations on the subtle smudges in our uniform gray carpet. About 11,000 floor cells are normally found, but the number drops rapidly with very small deteriorations of the correlator quality. For instance, if the correlation match measure is changed from the means-adjusted version:

Sum((a-b)^2) - (Sum(a)-Sum(b))^2/n

to a simple sum of squares of differences:

Sum((a-b)^2)

the number of floor cells drops to about 4,000. Simply rounding the correlation window means from 16 to 8 bits reduces the floor count from 11,000 to 7,000.

The means adjustment is not uniformly beneficial. It costs computation in the inner loop, and it discards information about the absolute brightness of windows, sometimes matching a white patch with a black one. In an experiment with a wider than usual correlation search, where only best matches were considered, for 20 high contrast windows drawn randomly from the data set, the simple and adjusted measures both made six errors in twenty correlations, but only three in common. In the cases where the simple measure was correct and the adjusted measure wrong, an absolute brightness difference overrode a spurious similarity in variations. In cases where the adjusted was right and the simple wrong, an overall difference in brightness between the pictures overrode a match in subtle contrast. The simple measure fared worse with low contrast windows, making 12 errors in 20 against the adjusted's 10 errors. Still, the simple measure was correct in one low contrast case where the adjusted measure was in error.

Optical and electronic adjustment differences, and automatic gain changes with changing views, ensure that the images from the two cameras will rarely match each other exactly in brightness. Subtracting out the window means usually improves the correlation by removing such differences. Sometimes it removes too much, and allows a very dark patch to match a very light one. We found a solution to this problem that has the added benefit of speeding up the correlation. A comparison of the means of the "a" and "b" windows in a correlation can be used as a preliminary discriminant: too different, and a match between the windows can be rejected without doing the expensive Sum((a-b)^2) calculation. The means can be precomputed for the entire images very efficiently by a sliding sum technique. A preliminary test of the idea showed that about half of the square sums could be eliminated by this approach, while the number of errors in the high contrast example above dropped from 6 to 4 out of the 20.

The program extracts multiple peaks from the remaining correlation curve, each representing a hypothesis of distance to the feature, and translates the match value of each peak into a relative probability using the formula

p[relative] = exp(-sqrt((Sum((a-b)^2) - (Sum(a)-Sum(b))^2/n)/n)/Cscale)

Cscale is an adjustable constant, which will probably be optimized by a future learning process that will also adjust other program parameters. It is in units of grayscale steps, and values of 0.5 to 2.0 seem to give good results. Each p[relative] is divided by the sum of p[relative] over all the peaks, to give a normalized probability. The normalized probabilities are used to weight evidence rays thrown for each distance hypothesis. For the examples in this report, the program generated rays for a maximum of four hypotheses per correlation search. As Cscale is made much smaller than 1, the probabilities for weaker correlations diminish relative to the strongest, eventually leaving only one peak per correlation.

The above scheme was derived from the following reasoning. Good information about correlation trustworthiness seems to be contained in the array of match values produced by a correlation search. All but (at most) one of the match values are simply from samples of the source window compared against unrelated patches of search image. A histogram of these values looks like a probability distribution with an exponential tail tapering down towards perfect match. A distribution fitted to the histogram, then integrated over matches better than a given one, estimates the probability that the given match is simply a chance coincidence. One minus that probability, is the probability that the match is not random, i.e. correct. Our exp(-match) function for weighting evidence rays approximates this latter probability.


FIGURES 8 and 9: The horizontal-only correlator at work. The large pictures show the images being searched. In each, the small square inset, from the other image of the stereo pair, contains the window being searched for: it is the tiny square at the center of the cross of bounding lines. Two vertical lines running the height of the picture mark the horizontal bounds of the search. The narrow horizontal band between these lines is the track of the search window. The graph at the top of the image between these lines is the match value sqrt((Sum((a-b)^2) - (Sum(a)-Sum(b))^2/n)/n) for each horizontal position, with zero, the best possible match, at the top of the image. Gaps in this curve are locations where the match value was not calculated because the window means, Sum(a)/n and Sum(b)/n, differed by more than about 10% of the total brightness range. The bar graph at the right of the image is a histogram of the match values in the curve. The graph at the bottom of the image shows the probability assigned to peaks in the correlation curve by an exponential model. Above a threshold indicated by the dashed line, each peak results in two rays of evidence, one from each camera position, weighted by the probability. Figure 8 shows an unambiguous match where one correlation peak dominates. In figure 9, the black edge of the chair finds close matches in at least four locations, even though half the search area is eliminated because of overly disparate brightness.


Evidence Ray Throwing

The heart of "crayfish" is the evidence ray code. In 1990 we noted that efficient two dimensional evidence grid programs were able to insert hundreds of thirty degree sonar beams per second into 2D grids sized 64 cells by 64 cells using 1 to 10 MIPS of computation, fast enough for real time use on robots of the day. Three dimensional grids promised a much richer world map, with not only an extra dimension, but higher resolution. Horizontal 2D maps conflate the surroundings at different heights, failing to distinguish projecting handles or overhanging tabletops from basic surfaces, for instance. For this reason, in typical environments, little is gained by choosing 2D grid resolutions better than about 10 cm. The world is consistent in 3D, however, and could usefully be mapped at 1 cm resolution. A high resolution 3D grid, covering an area similar to our several-thousand-cell 2D grids, would contain several million cells. With so many occupancy values to determine, the grid would also be hungry for proportionally more sensor data. This reasoning suggested that 3D grids would be a thousand times as computationally demanding as their 2D counterparts. The author started a 1992 sabbatical year at supercomputer manufacturer Thinking Machines Corporation with this expectation, intending to gain some early experience with 3D grids using a CM-5 supercomputer. A series of innovations and approximations, implemented with efficient representations and coding in an intense seven-month programming effort, resulted instead in a program "volsense" that was about a hundred times faster than anticipated, sufficiently fast for a conventional computer. Without such speed, it would have been impossible to demonstrate "crayfish" in 1996.

The focal point of "volsense" is a procedure to add precomputed evidence functions representing single sensor readings, collectively called a "sensor model", to 3D map grids. Each element of a sensor model is a spatial evidence pattern representing the occupancy information implied by a possible reading, for instance a particular range from a sonar ping or stereoscopic triangulation. The evidence accumulation operation is a simple integer addition, representing a Bayesian update, with quantities interpreted as probabilities in log-odds form, i.e. log(p/(1-p)).

In original conception, the sensor models for 3D grids would themselves be smaller 3D grids, which would be positioned and rotated relative to the map grid, corresponding to the position and orientation of the physical sensor. A first innovation noted that almost every sensor we considered, including sonar, stereo, or laser rangefinders and various proximity and touch sensors, could be approximately represented by evidence patterns symmetric about a view axis. The symmetry allows sensor models to be simple 2D grids, with one dimension radius "r" from the axis of view, the other distance "d" along it. In use, the "rd" planes are swept about the axis into cylinders as they are added to the map grid. A 3D grid map can be seen as a series of "xy" planes layered in the "z" direction. An "rd" cylinder intersects successive "xy" planes in a series of ellipses. The effective "z" direction can be chosen from among the three grid axes to minimize the eccentricity of these ellipses.

If the "xy" coordinates of each plane are shifted to put its origin at the center of its ellipse, the mapping between "xy" and "rd" coordinates on successive planes becomes very regular. Each particular "xy" has identical "r" on successive planes, and its "d" simply increases by a constant from one plane to the next. This allows an "xy"->"rd" addressing ellipse to be precomputed for one plane, and repeatedly reused to fill the cylinder. With all coordinates precombined into single address words in a table representing this ellipse, the inner evidence accumulation loop requires only ten single-cycle operations, mostly integer additions, per cell updated. A straightforward approach to combining 3D grids through arbitrary rotations is easily 10 times as expensive.

A factor of 4 efficiency came from considering only the cells that change the map, typically a cone radiating from the sensor. A cone has one quarter the volume of its bounding box. At the time "rd" sensor models are generated, the sensor model builder also stores a profile of the maximum "r" of significant data for each "d" value. For each ray cast, the slice precalculation puts the "xy"->"rd" addressing table into sorted "r" order. The mapping geometry assures each "xy" slice is updated from a V-shaped wedge in the "rd" array. For each slice, the evidence accumulation inner loop terminates when the "r" value in the addressing table reaches the profile's maximum "r" in the current wedge.

Another factor of 2.5 speedup was obtained for "free" from optimization level O3 in the 1992 vintage GnuCC compiler. This was far better than the optimizations provided by earlier compilers of 2D grid programs. Additional efficiencies occur in the addressing slice precomputation. The "xy"->"rd" table is put into "r" order by a linear-time bucket sort. The program exploits a four-way skew symmetry in the intersection ellipse, and gets advantages from various coding and incremental computation techniques. For the special case of "thin-rays" never more than one grid cell wide, the program employs much simpler code, about twice as fast as the general "fat-ray" method described above.

In 1992, on a 25 MIPS Sparc-2 workstation, the program was able to throw about 200 wide-beam sonar, or 4000 narrow-beam stereo-vision rays per second through a 128x128x128 world. In 1996, on a 100 MIPS Sparc-20, the speeds are four times as high.

Constructing Stereo Rays

In 1992 "volsense" contained a procedure for constructing sensor models inspired by our sonar experiences. A sensor model was shaped by 15 parameters, designed to be adjusted by a learning procedure, controlling evidence cone angle, depth of empty interior, height and thickness of occupied range surface, and how these values changed with distance.

Stereoscopic ranging was a poor fit to the generic model. Stereo distances and uncertainties are well defined by the geometry of the stereo triangulation. A procedure for constructing stereo-specific sensor models was added to "volsense". It contains one evidence pattern for each possible stereo disparity. The beam angle can be varied from that representing a single image pixel, to the width of a correlation window. The examples in this paper were generated with the latter setting. The depth range uncertainty is geometrically derived from a single horizontal pixel of correlation uncertainty. The evidence rays have a hand-chosen negative occupancy evidence from the imaging position to the beginning of the range uncertainty, and a much stronger positive value along the region of range uncertainty, diluted by the volume of uncertainty. These parameters and many others will be grist for a future learning procedure, which will vary them to produce better 3D grid maps. Figure 10 show a graphic representation of a stereoscopic sensor model.

In use, the program throws two rays for each correlator match peak, one from each camera position, intersecting at their range. Multiple hypotheses derived from a correlation curve are translated into multiple superimposed pairs of rays. The evidence in each pair is weighted by its match probability. This is accomplished efficiently, using precomputed sensor models with diminished evidence values. We typically configure the program to generate 8 or 16 complete sensor models, coarsely representing probabilities 0 to 1.


Figure 10: A picture summarizing a stereoscopic sensor model. Slices of evidence rays, destined to be swept into cylinders, are shown as thin dark vertical lines running downwards from the top of the image, ending in a bright spot. The dark lines are the "empty" parts of the rays, the bright spots are the "occupied" range. There is one sensor model for each pixel of possible stereo disparity. Large disparities, representing short distances (about a meter) are on the left. Small disparities, representing large distances, are on the right, truncated at a distance the program calculates will always be beyond the grid boundaries. Rays in the left of the image have a radius of one pixel. Two bands are seen on the right, with beams two pixels and three pixels in radius. The latter will be swept into cylinders about five grid cells in diameter. The gray banding effects are caused by the the conical shape of the rays. The radius two and three rays are "contaminated" by zero evidence regions outside their narrow starting points, that are gradually eliminated as the ray expands along its length. Zero evidence is represented by a probability of 1/2, which is rendered as a gray value half way between black "empty" and white "occupied", and so lightens the early parts of these rays.


Program Speed

On a 100 mips Sparc 20, a typical run of "crayfish" reports the following average timings per stereo pair processed for the program's major steps. Each pair generates about 2,500 depth values, each producing two evidence rays.

        Rectification of two images:   0.2 seconds
        Interest Operator:             0.1 seconds
        Correlator:                    0.9 seconds
        Ray Throw:              0.7 to 2.5 seconds

The Ray Throw time depends on the setting of Cscale, which controls how many rays are thrown for each correlation, from a minimum of one pair to a maximum of four pairs.

The Office Scan

"Crayfish" was developed with a data set collected with a pair of Sony XC-999 color cameras, with 6 mm lenses, mounted securely in parallel 15.6 cm apart on a tripod-mounted aluminum bar. Twenty stereo pairs of images were collected reasonably carefully by hand of one end of an office, in parallel directions, from locations on an approximately 50 cm grid on the floor, with the cameras 1.25 meters high, angled 14 degrees down. Figure 11 is from the back of the data set, and encompasses most of the scene. Figure 12 shows the placement of the cameras in the room.


Figure 11: An overview of the room first imaged by "crayfish". Twenty pairs of images were processed, from the view position of this illustration and forward, stopping at the location of the chairs.



                      =============================================    Y=0
                      H               H
                      H    cabinet    H           >=
                      H               H           >=
                      H ===============
                      H   //'               >=    >=                     1
                      H  //                 >=    >=
  tall                H //       chair
  table               O//                   >=    >=
                                            >=    >=

                                      >=    >=    >=    >=    >=    >=   2
                                      >=    >=    >=    >=    >=    >=
                      H           
                      H======o=====         >=    >=    >=    >=    >=
                      H open:               >=    >=    >=    >=    >=
                      H ____:     
                      H cab-I     chair     >=    >=    >=    >=         3
                      H inetI               >=    >=    >=    >=       meters
                      H =====o=====
                      ====================================================

          -1        X=0          1          2          3          4 meters

Figure 12: A schematic overview of the room mapped in the first runs of "crayfish". Camera positions are marked with the symbol ">=" representing a camera looking to the left. The image in figure 10 was taken by the camera just left of the Y = 2 meter mark.


"Wrapfish" produced figures 13, 14, 15 and 16, from the 3D evidence grid output by a "crayfish" run over the room images. Figures 13 to 15 show slices through the evidence grid. Black means "empty", white signals "occupied", while middle gray indicates no evidence. Each image consists of a number of black-bordered frames representing successive slices through scene, in the sequence left to right and top to bottom. Tick marks around the frames represent 1 meter distances in the scene.

Figures 13 and 14 are horizontal slices in the same orientation as figure 12. Each frame maps 3 1/8 vertical centimeters. Figure 13 covers 0 to 19 cm high. The first three slices in Figure 13 are covered with a dense mat of white spots representing the floor. The three or four prominent parallel lines on the left come from correlations on grooves in the tiled floor outside the office door. There are no corresponding perpendicular traces because the interest operator rejects features with only horizontal variation as unsuitable for matching by horizontally separated cameras. The two five-pointed stars seen in the latter three frames are the wheeled bases of the two chairs. Other prominent features are the open cabinet door, the base of a wall at the bottom of the images and the black swath of "emptiness" radiating out of the office door. Small but strong features include the office door frame, the rumpled raincoat in the open cabinet, and the base of the tall table. Figure 14 contains slices from 38 to 56 cm above the floor, and shows the seats and some of the backs of the chairs.

Figure 15 shows vertical slices 2.3 cm thick, parallel to the wall containing the door, from 15 cm outside to 35 cm inside the office. Prominent features are the door frame, the portion of wall with the light switch, the raincoat, and cabinet shelves with several small items. The raincoat's many wrinkles and shadows made it an especially effective subject for the interest operator and correlator.

Figure 16 is a 2D projection of the 90,000 "occupied" cells of the 3D grid. Much information is left out in this image, including the relative strengths of the occupied cells, all the contents of the 1.5 million "empty" and 2.5 million "unknown" cells, as well a an entire dimension of separation. Some depth cues are restored by "spotlighting" 3D volumes in distinctive colors. All the occupied cells in about a dozen box-shaped volumes have been given box-specific colors. One box colors the floor layers cyan. Each of the chairs is enclosed in a box of black color, the door frame is in a magenta box, the coat is encased in red, and so on. Occupied cells outside the major colored areas, probably the result of correlation errors, are colored yellow, and give the image a noisy background. The result lacks the precision and details of the slice images, but does a much better job of presenting the totality of the grid.

"Wrapfish" produces an even more dramatic display, in an "Open Inventor" 3D data file that can be viewed on Silicon Graphics workstations. A scene similar to Figure 16 results, that can be interactively rotated in 3D to give a much better sense of the volume relationships.


Figure 13: Horizontal slices representing the first 19 cm above the office floor, oriented as in figure 12. The carpet, grooves in tiles, the base of the chairs, a cabinet door, a long wall, the office door frame, a raincoat, the base of the tall table and other features are discernible.

Figure 14: Horizontal slices 38 to 56 cm above the floor. The seats, backs and arm rests of the chairs make an appearance.

Figure 15: Vertical slices moving towards the camera positions, from just outside the office door to just inside. The door frame, the raincoat and the cabinet shelves and their contents are visible.

Figure 16: A 2D projection of the 90,000 occupied cells in the 4 million cell 3D evidence grid of the office. About a dozen box-shaped regions have been "spotlighted" in color for clarity.



Work in progress: Learning 3D Sensor Models

"Crayfish" gives encouraging results in its first runs, with correlation and sensor models hand-chosen with modest thought and no experience. It can surely be improved, for instance in background noise level and object surface definition, with a better model. A learning process, optimizing a handful of parameters, made a dramatic improvement in a sonar/2D grid experiment [Moravec-Blackwell93], and we hope for similar results in the new 3D context.

Automatic learning requires a criterion to optimize. In the sonar/2D example we were able to hand-construct an "ideal map", or ground-truth model, against which reconstructed grids were compared. The 2D ideal, representing only a few walls and door at low resolution, was simple. The same approach is impractical in our high resolution 3D scene, which contains thousands of tiny details at the scale of the grid. Ground truth could perhaps have been derived from a high quality scanning laser rangefinder, but none was available while our scene existed, nor is it now.

Several weeks of experimentation failed to produce a good learning criterion among measures that compute various kinds of local solidity in the grid data. Some drove the grid to zero density, others filled it with noise. Some had parameters that could be adjusted between these extremes, or to achieve some specified average density, but none were convincingly measures of the correct answer. We considered hand-massaging the best 3D grids encountered to stand in for ground truth, but a more broadly useful approach has suggested itself.

There is a kind of ground truth in the original images of the scene, albeit in 2D projection and with surface color. The grid, on the other hand, is 3D, and has no color. It is easy to project the grid's 3D cells into 2D pictures (figure 17), recreating the geometry of the original images, but without color the grid projections convey little information. But original colors for the grid can also be obtained from the images!

We plan to evaluate a newly constructed 3D grid by dividing the images from which it was derived into a "coloring" set and an "evaluation" set. In the 20-pair office data, images from more forward positions are probably best for coloring, and those looking from the back for evaluation.

We will scan the 3D grid from far to near and make a list of positive ("occupied") cells in scan order. Typically there are a manageable 90,000 occupied cells out of the 4 million total cells. Each element of this list will get the grid coordinates of the cell, and also an, initially blank, "cell color".

The cells in the list will be rendered into the viewing geometry of each of the "coloring" images, producing corresponding synthetic images, with each pixel indicating the identity of the closest occupied cell visible at that position, or nothing if no cell projects there. The program will scan the synthetic images, and, at each non-empty pixel, average the color of the corresponding location in its "coloring" image into the "cell color" variable of the indicated cell. When done, each cell in the "occupied" list will have an associated color, averaged over all the images in which its was visible.

Then the program will project the newly colorized grid into the geometry of the "evaluation" images, and compute similarity. The similarity test is like the colorization, but instead of averaging the original image color with the cell color, the two are compared.

The evaluation is very similar to a human test of a reconstructed grid: it asks the question, "does it look right?". The approach uses uses information and measures interesting qualities in the source images not used in constructing the original grid. For the initial implementation, we will probably use monochrome images, averaged down to half size, to do the (grayscale) coloring and evaluation. Color images also exist, and we could use those later. A very interesting side benefit of the colorization process is a grid with the original scene colors, which should be exciting to view.


Figure 17: An original image and a synthetic image with the same imaging geometry. The synthetic scene colors were chosen by hand. When the learning program is complete, the colors will be natural, derived from other images of the scene.


Things to Twiddle

The 3D sensor-grid's many unexplored directions will surely provide years of amusement. A working learning process and a sensitive map quality criterion will provide the opportunity to optimize all parts of the process.

Many variations in the basic stereoopsis suggest themselves. Would it be better to preprocess the image for features like edges? Should the correlation window be round to maximize the number of pixels for statistical significance, while minimizing the distance from the central location? Should the windows be taller than wide, because a horizontal change in viewpoint causes more horizontal than vertical image variation? Should they vary in size? Would pre-adjusting the brightness and contrast of the images on the basis of the strong correlations improve the weak correlations? How would this interact with the brightness-difference cutoff threshold in the correlation search? Could the minimum range considered by a correlation search for weak features be modified from image to image, or even region to region, on the basis of strong correlations from the same area? The search costs and error rates are minimized if the search is narrowed as much as possible. How much improvement can be achieved by using other interest operators? Should they return more or fewer points? What about different comparison measures? How should evidence rays be weighted as a function of correlation quality? And many more.

Even more fundamental to the grid function is the sensor model, the shapes and weights of the evidence rays. The existing program uses a very simple, ad hoc sensor model for stereo evidence. An evidence ray, which is often a line, but can become a cone a few cells in diameter, consists of a hand-picked negative value from the camera to the region of stereo range uncertainty, followed by a positive value covering the length of the uncertainty. The positive value is another hand-chosen constant divided by the size of the positive volume. The weights are scaled by the correlation probability. The evidence grid approach derives trustworthy conclusions from very noisy data by accumulating large amounts of it for each spatial location. The accumulation process is compromised if the individual evidences have systematic biases. A good sensor model is a bias-free representation of the average true information from each kind of measurement. A learning process that adjusts angles, weights, sizes, shapes, dependence on distance and correlation and many other subtleties should converge on such a sensor model. Our simple hand-picked model is unlikely to be very close.

For some purposes, including producing synthetic images, it is necessary to classify evidence values into empty, occupied and unknown. The threshold values for these classifications could be optimized during learning.

Some important improvements are beyond the reach of simple learning. This report's data set was collected by hand, with tripod mounted cameras. The camera positions cameras varied at least a fraction of a centimeter from nominal, and the pan and tilt angles may have varied by more than one degree. The latter errors, especially, mean measurements from different views were not perfectly registered. Images from moving robots are likely to have even larger position and orientation uncertainties. We and others [Yamauchi96], [Schultz-Adams-Grefenstette96] have had good success in past registering sonar/2D grids of the same area made from viewpoints, whose relationship is known only approximately, by searching for best map match over a range of relative positions and orientations. The large number of cells in the grids allow registration accurate to a fraction of a cell, by interpolation of the discrete matches. This works even when one of the grids is built with only a few dozen sonar readings from a single robot position. The statistically stabilizing multitude of cells in the 3D grids assures us that the approach will work even better there. The only question is speed: our grid has four million cells, and a search over large ranges of 6 degrees of freedom could easily take trillions of computations! Fortunately there are many strategies for enormously reducing the cost. For a mobile robot on a flat floor, only 3 degrees of freedom vary very much, so the search extent in the other three can be very small. Even the unconstrained position and angle searches can be kept small by using dead-reckoning information. The search can be reduced almost to its logarithm, by a coarse-to-fine strategy, where approximate answers are computed first at the small end of a pyramid of shrunken grids. Coarse-to-fine has worked very well for us in past, in unconstrained stereoscopic correlation and in matching 2D grids. Well-chosen samples could substitute for the whole: for instance, the unknown areas of a grid could be skipped, or the occupied cells of one grid could be mapped to the entirety of the other. A branch and bound strategy might cut short some of the comparisons, when their accumulating partial errors exceed the best previous comparison, a method that works best if the most likely registrations are searched first.

We hope to implement an image-pair to scene registration step soon after the learning code is working. It is likely to tighten up the office scene, and will prepare the way for stereo/3D grid use on real robots.

Hardware Helpers

Quirks in our office scene reconstruction suggest some easy hardware alterations likely to significantly improve performance. The dense coverage the correlator obtains on the slightly textured carpet, and more heavily textured raincoat, compared to its sparseness on plain surfaces like doors and walls, suggests that coverage would be greatly increased by the artificial addition of texture across the scene. This could be accomplished simply by mounting a randomly textured floodlight with the cameras, perhaps infrared to minimize effects on humans.

More speculatively, wide angle sonar could provide the range to the nearest feature in the scene, to narrow the correlation search to that range and beyond, boosting speed and accuracy. A preliminary stereoscopic preprocessing of strong features might give similar information without extra hardware.

Precise image rectification allows the correlation search to be exclusively in the direction of the axis horizontal joining the two cameras. The points to be searched for, selected by an interest operator, need thus have only horizontal variations. As a result, the program picks out a dense coverage of features along vertical wall and furniture boundaries, but entirely misses the horizontal boundaries. This problematic blindness could be eliminated by adding a third camera to the stereo pair, vertically separated from the original two. The trio could be used in pairs, with points selected by interest operators specific to each pair's separation direction.

The 60 degree field of view cameras, though placed on most of the open floor area of the office, saw only a tiny portion of the left wall and none of the right, and viewed other elements of the scene from a very narrow range of angles. The grid might be much enhanced if objects were viewed from many directions. Wider angle lenses would help, as would panning the cameras from side to side during operation, or having several camera pairs looking in different directions. The latter option is probably best, as manufacturers begin to market high quality, inexpensive single-chip cameras with integrated optics, for applications like hand-held videoconferencing. Simply processing more images would also help, which will surely happen when the cameras are used on a mobile robot that needs new data constantly.

Before the fast ray-throwing code was written, speed seemed the bottleneck for 3D grids. Now "crayfish" is quite speedy, but encounters memory limits. Experiments with existing data and small grid extents (see the appendix) show that scene details improve as grid cell size is reduced, even to below 1/2 centimeter. The speed cost for increased resolution is modest. For a wide range resolutions, most stereo rays remain one cell in width, and the time to throw rays grows only slightly more than linearly with grid resolution. The image processing steps are unaffected, so total runtime grows perhaps 50 percent per resolution doubling. Memory usage, on the other hand, grows dramatically. Our 256x256x64 grid has four million two-byte cells, consuming about half the program's 16 megabytes. Doubling the resolution balloons the grid to 64 megabytes. Lowering the cell size to 1/2 centimeter would require a gigabyte of grid. Memory costs are dropping, and sizes increasing with a time constant of about 2 years, so even such numbers should soon be within reach. It's comforting to know that the existing approach will produce even better results in future, simply through hardware progress.

Future Extensions and Applications

"Crayfish" builds 3D maps from a set of image files, a first step to myriad goals. Its descendents will process real-time images from moving robots. The robots will use their 3D understanding of the surroundings to navigate confidently, to locate and interact with objects, eventually to plan and execute complex tasks involving locomotion and manipulation. To do all that, they will need application programs to extract specific answers from thei 3D grids. Here are few answer extractors on our "to do" list:

Obstacle avoidance

Our 3D grids contain over 100 times as much information about robot surroundings as the 2D maps or sparse 3D models derived by previous mobile robot programs. Since robots can be made to maneuver reasonably well with the sparser information, we have very high expectations for 3D-grid based local navigation. In particular, corners, horizontal projections and inconveniently placed small obstacles, that can be missed or misinterpreted by the simpler approaches, sometime with serious consequences, appear clearly and unambiguously, in shape and position, in a 3D grid, ready to be noticed by a path-planning program. Finding paths in a 3D grids need not be extremely computationally costly. Since the grid specifically marks empty volumes, the frequent case of an unimpeded path from source to local destination can be quickly detected, by sweeping a 3D grid volume in the shape of the robot along a straight path in the 3D map. The collision probability at each position is the integral of the the product of occupancy probabilities of corresponding map and robot cells. This computation can be made very cheap by a coarse-to-fine strategy, which refines the resolution only for those few cells where coarse comparison detects a possible conflict. A full path planner could be constructed by embedding such a measure in an A* search. By doing its work in 3D, the planner could produce paths that scoot the robot under overhangs and squeeze its shape through commensurate gaps.

Navigation

Nearly every indoor mobile robot task requires the robot to return to previous locations. Existing commercial delivery and cleaning robots do this with the help of buried wires, wall or floor mounted beacons or markers, or painstaking handmade site-specific maps of walls and openings, whose frequent encounter corrects odometric dead-reckoning. Experimental robots that navigate with autonomously made 2D maps (eg. [Koenig-Simmons96], [Schultz-Adams-Grefenstette96], [Yamauchi96]), are not quite reliable enough for commercial use, which demands months of operation without navigational failure. Using million-cell 3D maps, with 100,000 surface cells, instead of thousand-cell 2D maps, containing at most a few hundred surface points, should decrease the failure probability enormously. Suppose we want a robot to autonomously repeat routes shown it on a guided tour. During the tour, the robot would build and store a sequence of 3D map "snapshots" of the journey. When on its own, it would match these snapshots against its fresh 3D. The relative position and orientation of best match would tell the robot its location relative to its tour route--to a fraction of a cell, if it interpolates. The matching can be done efficiently using the same coarse-to-fine and selective sampling techniques mentioned in the last section for precisely registering new data to old in a developing map. The 3D grids contain so huge a mass of detail, at large and small scales, that the probability of a good match between two maps of the same area, at other than the correct placement, should be astronomically small. By contrast, with simple line maps or blurry 2D grid maps, the probability of a false match, and consequent serious navigational error, is significant. Enormous reduction in confusion probabilities is a major 3D grid promise.

Object recognition

Volumetric matching, as suggested for navigation, might also be used at smaller scales, and possibly higher resolutions, to detect objects by shape. Grid prototypes of objects can be matched to map areas in the same way maps are matched to each other. At a given resolution, the number of cells in a small object, and thus the reliability of the match, will be lower. Reliability can be improved by increasing the grid resolution, in an approach we may call "fine-to-finer". Stereoscopic data collected for navigation can be used again to map small volumes at sub-centimeter resolution, as shown by the chair example in the appendix. There are complications. Office chairs, for instance, have parts that move with respect to one another. The back, seat and base of the chair could be sought volumetrically separately, and their match values combined in a formula, with measures of the relative geometric positions, in a hybrid "chair operator" that could be applied to likely positions in the scene. Scene colors could also weigh in the formula. Experience may suggest a bag of such recognition techniques, which can perhaps be orchestrated by an object description language, to conveniently construct operators for even changeable objects like chameleons. (A similar approach should work for general navigation.) Experience with that may then suggest how to automate the object (and route) description processes.

Scene Motion

Data for "crayfish" was accumulated over time, with no regard to intervening changes in the scene. If there had been changes, the "before" and "after" states would have been averaged together, equally weighted. At each location, the program accumulates a weight of evidence of occupancy. If a given location has been identified as occupied many separate times, this weight can become very large. If a physical scene rearrangement subsequently empties the location, it will take very many additional observations to erase the weight in the grid, and replace it with a negative one. We could deal with overly persistent memories by a deliberate "forgetting" process. As each update is made, affected cells would have a small proportion of their old evidence trimmed, thus exponentially decaying evidence over the number of sensings of an area, and giving new measurements relatively higher weight. An object moving through a repeatedly sensed area would then show up as a fading streak in the grid, just like the slow-phosphor trail of a moving object on an old-style radar screen. Decay over sensings would leave recently unsensed map areas unchanged. There are reasons to decay those also, but probably at a different rate. If, in its wanderings, the robot steadily accumulates navigational errors, the registration of very old data with new will become uncertain. This could be accounted for by spatially blurring the old data - we have demonstrated this in 2D grids [Elfes87]. All these techniques will become more effective as increasing computer speed allows us to do more frequent sensings. With 500 MIPS, it should be possible to process stereo images about once per second. When computer power and memory several times greater becomes available, it will become possible to take the grid idea into the fourth dimension, with 4D grids representing the local spacetime. Doing so would replace motion blurs with a stack of instantaneous frames of the scene, from which detailed moving object dynamics could be derived. For the next several years, however, we will have our hands full with 3D grids.

Conclusion

Since 1984, two-dimensional grids have proven effective in turning mobile robot sense readings into reliable maps, even when individual readings are very ambiguous or noisy. The grid approach catalogs the surroundings by location, a division independent of individual measurements or their errors, allowing arbitrarily many individual readings to be accumulated before any conclusions are derived. It tames sensor noise with large-number statistics. Most competing approaches are much less tolerant of noise, because they must decide on interpretations, for instance about surfaces, with far less data. The grid approach threatened to require more memory and computation, but its simplicity and regularity allow very efficient implementations. Many experimental mobile robots today are controlled by real-time 2D grid programs.

Two dimensional grids offer, at best, a blurry view of a slice of the world. We considered them a prelude to 3D grids, which could represent the complete surroundings in precise detail--the basis of a high-quality robotic sense of space. Their apparent 1,000 times higher computational cost delayed the effort to try 3D until 1992, when a supercomputer became available. Instead of a supercomputer implementation, we produced a surprisingly fast uniprocessor program for the key 3D evidence-ray throwing step. A complete prototype 3D system, described in this report, was finally completed in 1996.

Though it uses unoptimized ad-hoc sensor models, chosen for ease of implementation, the 3D grid program clearly demonstrates the effectiveness of the approach, and that 500 MIPS would suffice for a real-time version. From stereoscopic range data with 20 to 50 percent errors, it constructs maps that, imaged, look like doll-house replicas of the sensed area. The existing program is good enough to feed a highly reliable robot navigator, but we anticipate increasingly better results, as nearly every portion is scrutinized and optimized.

Appendix

The nice results from our initial office image set encouraged us to collect a second, larger, example. At the beginning of August 1996, Cristopher Priem collected 59 stereo pairs of images of a busy 8 meter by 8 meter laboratory, with the stereo cameras at a height of 75 cm, aimed horizontally. In part of the data set, the cameras looked from the back of the room towards the front. In a second portion, they looked in a 45 degree diagonal direction, in a third set they looked from the side, and three stereo pairs looked backward from the front. Images from the diagonal and backward-looking sets are shown in Figure 18.

Though the physical area mapped in the second data set was larger that the 6 meter room of the first set, memory size limitations prevented us from using a larger grid. The resolution of the 256x256x64 laboratory grid cells is thus only 3 1/8 cm in all three axes. Figures 19 and 20 show horizontal slices through the resulting grid, at floor and chair-seat heights, with the same encoding as figures 13 and 14. Figure 21 contains horizontal slices, as seen from the front of the room looking back, through a chair, two boxes and cabinets. Figure 22 is a colorized projection of the approximately 90,000 occupied cells of the entire grid.

The reduced grid resolution results in a loss of quality in the reconstruction. Memory limitations prevented us from increasing the entire grid size, be we did an experiment with higher resolution, by using all the data to map only a small portion of the lab volume, namely the cubic meter enclosing the chair at the middle front of the room. A 128x128x128 grid for this volume has cell resolution of 0.78 cm. Figure 23 is a projection of the high-res occupied cells. We hope, in future, to map larger areas at such resolution.


Figure 18a and 18b: Views of the laboratory imaged in the second "crayfish" test. Fifty nine pairs of images were processed, looking forward, left-forward diagonally (as in the 18a), from the left side, and three pairs facing backward (18b).

Figure 19: Horizontal slices representing the first 19 cm above the laboratory floor. The front of the room is to the right, the door is at the bottom.

Figure 20: Horizontal slices 38 to 56 cm above the laboratory floor.

Figure 21: Vertical slices marching towards the front of the room, oriented as if viewed from the front looking backward. Slices of the chair and small box seen in figure 18 top are visible, as is a larger box, out of view of figure 18. The prominent vertical features running top to bottom in the first frames are cabinets that can just be made out on the left of figure 18 bottom.

Figure 22: A 2D projection of the 90,000 occupied cells in the 4 million cell 3D evidence grid of the laboratory. About a dozen box-shaped regions have been spotlighted in color for clarity.

Figure 23: A 2D projection of the 10,000 occupied cells in the high resolution grid of the chair.


References

[Moravec81]
Author:
Moravec, Hans P.
Robotics Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
Title:
Robot rover visual navigation
Series Title:
Computer science. Artificial intelligence ; no. 3
Publisher:
Ann Arbor, Mich. - UMI Research Press, c1981.
Abstract:
The Stanford AI Lab cart is a remotely controlled TV equipped mobile robot. A computer program has driven the cart through cluttered spaces, gaining its knowledge of the world entirely from images broadcast by the onboard TV system. The cart uses several kinds of stereo to locate objects around it in 3D and to deduce its own motion. It plans an obstacle avoiding path to a desired destination on the basis of a model built with this information. The plan changes as the cart perceives new obstacles on its journey. The system is reliable for short runs, but slow. The cart moves one meter every ten to fifteen minutes, in lurches. After rolling a meter it stops, takes some pictures and thinks about them for a long time. Then it plans a new path, executes a little of it, and pauses again. It has successfully driven the cart through several 20 meter courses (each taking about five hours) complex enough to necessitate three or four avoiding swerves. Some weaknesses and possible improvements were suggested by these and other, less successful, runs.
Notes:
Revision of the author's thesis (Ph.D.)--Stanford, 1980.
Includes index.

[Moravec83]
Author:
Moravec, Hans P.;
Robotics Institute, Carnegie Mellon University, PA, USA
Title:
The Stanford Cart and the CMU Rover
Source:
Proceedings of the IEEE; vol.71, no. 7; Jul. 1983 pp.; pp. 872-13pp.
Abstract:
The Stanford Cart was a remotely controlled TV equipped mobile robot. A computer program was written which drove the Cart through cluttered spaces, gaining its knowledge of the world entirely from images broadcast by an onboard TV system. The CMU Rover is a more capable, and nearly operational, robot being built to develop and extend the Stanford work and to explore new directions. The Cart used several kinds of stereoopsis to locate objects around it in 3D and to deduce its own motion. It planned an obstacle avoiding path to a desired destination on the basis of a model built with this information. The plan changed as the Cart perceived new obstacles on its journey. The system was reliable for short runs, but slow. The Cart moved one meter every ten to fifteen minutes, in lurches. After rolling a meter it stopped, took some pictures and thought about them for a long time. Then it planned a new path, executed a little of it, and paused again. It successfully drove the Cart through several 20 meter courses (each taking about five hours) complex enough to necessitate three or four avoiding swerves; it failed in other trials in revealing ways. The Rover system has been designed with maximum mechanical and control system flexibility to support a wide range of research in perception and control. It features an omnidirectional steering system, a dozen onboard processors for essential real time tasks, and a large remote computer to be helped by a high speed digitizing/data playback unit and a high performance array processor. Distributed high level control software similar in organization to the Hearsay II speech understanding system and the beginnings of a vision library are being readied. By analogy with the evolution of natural intelligence, we believe that incrementally solving the control and perception problems of an autonomous mobile mechanism is one of the best ways of arriving at general artificial intelligence.

[Moravec-Elfes85]
Author:
Moravec, H.P.; Elfes, A.E.;
Robotics Institute, Carnegie Mellon University, PA, USA
Title:
High Resolution Maps from Wide Angle Sonar
Source:
Proceedings of the 1985 IEEE International Conference on
Robotics and Automation, St. Louis, March, 1985, pp 116-121;
Abstract:
We describe the use of multiple wide-angle sonar range measurements to map the surroundings of an autonomous mobile robot. A sonar range reading provides information concerning empty and occupied volumes in a cone (subtending 30 deg in our case) in front of the sensor. The reading is modelled as probability profiles projected onto a rasterized map, where somewhere occupied and everywhere empty areas are represented. Range measurements from multiple points of view (taken from multiple sensors on the robot, and from the same sensors after robot moves) are systematically integrated in the map. Over- lapping empty volumes re-inforce each other, and serve to condense the range of occupied volumes. The map definition improves as more readings are added. The final map shows regions probably occupied, probably unoccupied, and unknown areas. The method deals effectively with clutter, and can be used for motion planning and for extended landmark recognition. This system has been tested on the Neptune mobile robot at CMU.

[Elfes87]
Author:
Elfes, A.;
Robotics Inst., Carnegie-Mellon Univ., Pittsburgh, PA, USA
Title:
Sonar-based real-world mapping and navigation
Source:
IEEE Journal of Robotics and Automation; IEEE J. Robot. Autom.
(USA); vol.RA-3, no.3; A07; June 1987 ; pp. 249-65 pp.
Abstract:
A sonar-based mapping and navigation system developed for an autonomous mobile robot operating in unknown and unstructured environments is described. The system uses sonar range data to build a multileveled description of the robot's surroundings. Sonar readings are interpreted using probability profiles to determine empty and occupied areas. Range measurements from multiple points of view are integrated into a sensor-level sonar map, using a robust method that combines the sensor information in such a way as to cope with uncertainties and errors in the data. The sonar mapping procedures have been implemented as part of an autonomous mobile robot navigation system called Dolphin. The major modules of this system are described and related to the various mapping representations used. Results from actual runs are presented, and further research is mentioned. The system is also situated within the wider context of developing an advanced software architecture for autonomous mobile robots.

[Moravec-Blackwell93]
Author:
Moravec, H.P.; Blackwell, M.;
Robotics Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
Title:
Learning Sensor Models for Evidence Grids
Source:
CMU Robotics Institute 1991 Annual Research Review, 1993, pp.8-15.
Abstract:
Evidence grids (aka. occupancy, probability and certainty grids) are a probabilistic, finite-element representation of robot spatial knowledge. The grids allow the efficient accumulation of small amounts of information from individual sensor readings into increasingly accurate and confident maps. Each sensor measurement is translated, via a sensor model, into a spatial evidence distribution that is added to a grid representing the robot's surroundings. In our first applications of the method, on a mobile robot with a ring of 24 Polaroid sonar transducers autonomously navigating a cluttered room, we constructed the sensor model from a cursory examination of the Polaroid literature. Despite the ad-hoc model, the grid approach worked far better than an older program, a decade in development, that used a geometric model. It successfully navigated cluttered rooms, most hallways, a coal mine and outdoors. The original program failed in a smooth-walled narrow corridor, where most sonar pulses, deflected by mirrorlike walls, indicated overlong ranges. The evidence grid method might be able to slowly accumulate evidence from such data, if only the sensor model accurately represented the modest information contained in a reading, as our ad-hoc model did not. This paper reports on a learning program that finds good models automatically. The sensor model is formulated as a closed form expression shaped by several parameters. The parameters are adjusted, in a hill-climbing process, that maximizes the match between a hand-constructed ideal map and a map built by the model with data from a robot test run in the mapped area. Using this approach with a 9- parameter function a program using several weeks of Sparc1+ workstation search time was able to produce a crisp, correct map of the difficult smooth hallway, from data that produces an unrecognizable splatter when interpreted by our original ad-hoc sensor model.

[Hellwich+al94]
Author:
Hellwich, O.; Heipke, C.; Liang Tang; Ebner, H.; Mayr, W.;
Tech. Univ. Munchen, Germany
Title:
Experiences with automatic relative orientation
Source:
ISPRS Commission III Symposium Spatial Information from Digital
PHOTOGRAMMETRY and Computer Vision; Munich, Germany; 5-9 Sept. 199
Sponsored by:
SPIE; Proceedings of the SPIE - The
International Society for Optical Engineering;
Proc. SPIE - Int. Soc. Opt. Eng. (USA); vol.2357, pt.1; 1994; pp. 370-8
Abstract:
A report on recent experiences with a new procedure for automatic relative orientation is given. A hierarchical approach provides conjugate points using image pyramids. The implemented method is based on feature extraction by an interest operator, feature matching between the two images, area based image correlation, a robust least squares bundle adjustment and feature tracking through several levels of the image pyramid. The automatic procedure proved to be successful for various combinations of terrain relief surface cover and image scale. The paper presents the results of intensive tests. It was found that only a limited number of control parameters of the algorithm has to be individually adjusted to the terrain specifics. Further investigations need to be conducted to find out whether the project dependent parameters can be reliably predicted for certain classes of images.

[Stunz-Knopfle-Roth94]
Author:
Strunz, G.; Knopfle, W.; Roth, A.; Remote Sensing Data Centre,
German Aerosp. Res. Establ., Oberpfaffenhofen, Germany
Title:
Automation of tie pointing procedure for the geocoding of satellite images
Source:
ISPRS Commission III Symposium Spatial Information from
Digital PHOTOGRAMMETRY and Computer Vision; Munich, Germany;
5-9 Sept. 1994
Sponsored by:
SPIE; Proceedings of the SPIE -
The International Society for Optical Engineering; Proc. SPIE
- Int. Soc. Opt. Eng. (USA); vol.2357, pt.2; 1994; pp. 793-800
Abstract:
The comprehensive utilization of the information content provided by remote sensing satellites requires the multitemporal and multisensoral analysis of the image data. Prior to this analysis, processing is necessary to correct for geometric distortions. To handle the large amount of data and the high data rates, the traditional approach of visual identification of control or tie points is not an acceptable solution. The paper describes two developments at the German Remote Sensing Data Centre, which aim at the automatic generation of tie points for optical sensors and radar images, and the operational utilization of automatic tie pointing for ERS-1 SAR data. A concept for the multitemporal registration of satellite images from optical sensors is described. This approach comprises the detection of distinct points by a so-called interest operator in one image. By cross correlation, the corresponding points in the other images are traced. Subpixel accuracy is achieved by a least squares technique for digital image matching. Practical results are shown, where multitemporal image data from the Landsat TM sensor are used. The paper also describes the operational utilization of an automatic tie pointing which has been realized for the geocoding of the ERS-1 SAR data at the German Processing and Archiving Facility. Significant features are extracted from reference data sets. The procedure is described as well as the accuracy achieved. An outlook is given concerning several automatic approaches being investigated.

[Koenig-Simmons96]
Author:
Koenig, S.; Simmons, R.G.;
Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
Title:
Unsupervised learning of probabilistic models for robot navigation
Source:
Proceedings of IEEE International Conference on
Robotics and Automation;
Part:
vol.3; Minneapolis, MN, USA; 22-28
April 1996;
Sponsored by:
IEEE Robotics + Autom. Soc;
Proceedings. 1996 IEEE International Conference on Robotics and
Automation (Cat. No.96CH35857); New York, NY, USA; IEEE; 4 vol.
lxiv+3749; 1996; pp. 2301-8 vol.3
Abstract:
Navigation methods for office delivery robots need to take various sources of uncertainty into account in order to get robust performance. In previous work, we developed a reliable navigation technique that uses partially observable Markov models to represent metric, actuator and sensor uncertainties. This paper describes an algorithm that adjusts the probabilities of the initial Markov model by passively observing the robot's interactions with its environment. The learned probabilities more accurately reflect the actual uncertainties in the environment, which ultimately leads to improved navigation performance. The algorithm, an extension of the Baum-Welch algorithm, learns without a teacher and addresses the issues of limited memory and the cost of collecting training data. Empirical results show that the algorithm learns good Markov models with a small amount of training data.

[Yamauchi96]
Author:
Yamauchi, B.;
Inst. for the Study of Learning + Experience, Palo Alto, CA, USA
Title:
Mobile robot localization in dynamic environments using dead
reckoning and evidence grids
Source:
Proceedings of IEEE International Conference on Robotics and
Automation;
Part:
vol.2; Minneapolis, MN, USA; 22-28 April 1996;
Sponsored by:
IEEE Robotics + Autom. Soc;
Proceedings. 1996 IEEE International Conference on Robotics
and Automation (Cat. No.96CH35857); New York, NY, USA; IEEE;
4 vol. lxiv+3749; 1996; pp. 1401-6 vol.2
Abstract:
Dead reckoning provides a simple way to keep track of a mobile robot's location. However, due to slippage between the robot's wheels and the underlying surface, this position estimate accumulates errors over time. In this paper, we introduce a method for correcting dead reckoning errors by matching evidence grids constructed at different times. A hill-climbing algorithm is used to search the space of possible translations and rotations used to transform one grid into the other. The transformation resulting in the best match is used to correct the robot's position estimate. This technique has been tested on a real mobile robot and has demonstrated robustness to transient changes (moving people) and lasting changes (rearranged obstacles) in dynamic environments.

[Schultz-Adams-Grefenstette96]
Author: Alan C. Schultz, William Adams, and John J. Grefenstette (1996).
NCARAI,
Naval Research Laboratory, Washington, DC 20375-5337
Title:
Continuous localization using evidence grids
Source:
NCARAI Report AIC-96-007, Naval Research Laboratory,
Washington, DC. schultz@aic.nrl.navy.mil
Abstract:
The evidence grid representation provides a uniform representation for fusing temporally and spatially distinct sensor readings. However, the use of evidence grids requires that the robot be localized within its environment. Odometry errors typically accumulate over time, making localization estimates degrade, and introducing significant errors into evidence grids as they are built. We have addressed this problem by developing a new method for ``continuous localization'', in which the robot corrects its localization estimates incrementally and on the fly. Assuming the mobile robot has a map of its environment represented as an evidence grid, localization is achieved by building a series of ``local perception grids,'' also represented as evidence grids, based on localized sensor readings and the current odometry, and then registering the local and global grids. The registration produces an offset which is used to correct the odometry. Results are given on the effectiveness of this method, and quantify the improvement of continuous localization over dead reckoning. Further results show that maps built of the room using evidence grids while the robot is performing continuous localization show no appreciable systematic errors due to odometric error; it appears that maps of the room can simultaneously be learned and used for continous localization.

[Martin-Moravec96]
Author:
Martin, M.C.; Moravec, H.P.;
Robotics Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
Title:
Robot Evidence Grids
Source:
Carnegie Mellon University, Robotics Institute Technical Report
CMU-RI-TR-96-06
Abstract:
The evidence grid representation was formulated at the CMU Mobile Robot Laboratory in 1983 to turn wide angle range measurements from cheap mobile robot-mounted sonar sensors into detailed spatial maps. Whereas most older approaches to mapping considered the location of objects, the grid approach weighs the "objectness" of locations. It accumulates diffuse evidence about the occupancy of a grid of small volumes of nearby space from individual sensor readings into increasingly confident and detailed maps of a robot's surroundings. It worked surprisingly well in first implementation for sonar navigation in cluttered rooms. In the past decade its use has been extended to range measurements from stereoscopic vision and other sensors, sonar in very difficult specular environments, and other contexts. The most dramatic extension yet, from 2D grid maps with thousands of cells to 3D grids with millions, is underway. This paper presents the mathematical and probabilistic framework we now use for evidence grids. It gives the history of the grid representation, and its relation to other spatial modeling approaches. It discusses earlier formulations and their limitations, and documents several extensions. A list of open issues and research topics is then presented, followed by a literature survey.