Stanford Artificial Intelligence Laboratory April 1979
Visual Mapping by a Robot Rover
Hans P. Moravec
COMPUTER SCIENCE DEPTARTMENT
Stanford University
Abstract
Visual Mapping by a Robot Rover by Hans P. Moravec
The Cart is a camera equipped rover, computer controlled over
a radio link. This report describes a working program that has
successfully driven the cart through cluttered real world
environments, and has deduced the cart's motion, and built a map of
its surroundings, entirely from what the onboard TV camera saw. The
program builds a sparse 3D map of its surroundings, but makes no
attempt to classify the objects it sees, other than to label them as
obstacles or non-obstacles. The method's success is largely due to
high redundancy and cross checking at several levels. The program is
now being extended into one that will drive the cart to a desired
destination, avoiding any obstacles that are encountered.
\!picture((DSK:CART.PIC[PIC,HE]));
\!fig(1,(The Stanford AI Cart. It is a meter square and
a meter and a half tall. It broadcasts a standard B/W television
picture on UHF channel 23, and is remote controlled over a citizen's
band frequency.));
Introduction
In 1977 we reported on work whose goal was a visually guided
roving robot (called the Cart) able to go to a desired location in the
real world, avoiding obstacles along the way [1]. The results at the
time were discouraging; a major approach had just been abandoned as
unworkable. Some new hardware and an entirely new program has made the
situation far more cheery this year. The vision portion of the project
is essentially complete and working satisfactorily. The cart has been
driven through a real environment (a large cluttered room) several
times, passively observing the world through its TV camera. During
these trips the control programs constructed a correct three
dimensional map of objects in the room, and of the cart's trajectories
through them. A major key to success was the use of massive
redundancy at several levels of the computation.
Work is now underway on code that will use the map generated
and kept updated by the visual system to plan an obstacle avoiding
path to the desired destination, and drive the cart along it.
Constraints
The research was influenced by a pre-existing vehicle with
fairly crude mechanical and electronic attributes, and a computer
system of medium capability.
The cart (Figure 1) is a self contained, remote controlled,
battery powered electric vehicle. It carries a TV camera and
transmitter, and broadcasts a standard black and white television
image on a UHF channel.
The remote control link evolved over the years from a model
airplane proportional controller, and now transmits six bits
continuously with a time resolution of 30 ms. Two of the bits control
the drive motors, permitting the four states: Off, Forward, Reverse
and Dynamic Braking. Two other bits command the steering motor: Motor
Off, Steer Left, Steer Right, Steer to Straight Ahead. The steering
speed is servoed, albeit crudely, and an approximate wheel angle can
be achieved by centering the wheels with the straight ahead command,
and then steering left or right for a precise amount of time. The
errors incurred this way are in excess of 10 degrees. The remaining
two bits once controled a motor that rotated the TV camera in a
horizontal plane in a manner similar to the steering. This pan
mechanism been replaced by a stepping motor drive that slides the
camera from side to side on a 52 cm track in precise 0.25 mm steps
(Figure 2). A set of thumbwheel switches on the cart determines how
many steps the camera moves with each right or left command. The
number is currently 256, effectively dividing the track into eight
parts, and giving the camera nine stopping positions across it. These
provide a highly redundant nine-eyed stereo baseline for distance
measurements.
\!picture((DSK:CARPAN.PIC[PIC,HE]));
\!fig(2,(A closeup of the camera slider mechanism.
The camera can translate sideways over a 56 cm range, in
0.25 mm steps.));
The only data transmitted from the cart is the camera's TV
image. There are no wheel position sensors, touch or force detectors,
or range finders, nor any means for sending data from such. It was
decided early in this research, partly because of funding limitations,
that while such devices would be an asset in a real application such
as a Mars rover, they were not essential in an experimental vehicle
whose purpose was development of vision software. It is obvious that
some of the problems encountered would have had simpler or more
reliable solutions were other sensors available.
On the positive side, an independent visual navigation system
will likely be a desirable redundant backup on a camera equipped
vehicle that navigates primarily by other means.
Under construction is an onboard microprocessor system for the
cart that will allow bi-directional, error checked, communication and
also loading and execution of simple onboard control and servoing
programs. The obstacle avoiding work described here will be completed
without benefit of this new hardware.
\!picture((DSK:CARSPO.PIC[PIC,HE]));
\!fig(3,(The cart in its calibration posture before
the calibration pattern. A program automatically locates the cross
and the spots, and deduces the camera's focal length and distortion.));
\!graphic((DSK:SPOT.GOD));
\!fig(4,(Results of the calibration program. The distortion
polynomial it produced has been used to map an undistorted grid
of ideal spot positions into the calculated real world ones.
The result is superimposed on the original digitized spot
image to make any discrepancies obvious.));
Overview
From outside, the cart runs completed so far look as follows.
The cart is placed 3 meters in front of and pointed towards a 3 meter
square pattern of dots on a wall (Figure 3). A program that digitizes
the image and automatically locates the spots and determines the
camera's focal length and distortions is run (Figure 4). The cart is
then driven to the starting location of a clear path in a cluttered
space, and a navigating program is started. This reads in the camera
calibration file produced by the first program (actually calibrations
need to be done only occasionally), and takes control of the cart.
At first the camera slides to the left on its track, until it
encounters its left limit. Nothing mechanical happens for a few
seconds while an image is digitized. The camera then slides one
increment of 256 steps, 6.4 cm to the right, digitizes another, and so
on for a total of nine images. Afterward the cart sits immobile for
about ten realtime minutes while our timeshared KL-10 works on the
pictures for 180 compute seconds. Then it lurches about a meter
forward, stops, and repeats the camera motion.
As the cart moves slowly forward, meter by meter, the program
generates and updates a three dimensional world model of objects it
has seen, and deduces the vehicle's motion through them. This model
is displayed in map form. Initially the map is blank, as time (and the
cart) goes by more and more points are added. The cart's trajectory is
also plotted (Figures 10 and 11).
This concludes the description of the existing program's
behavior. When the last stage of the work is completed the activity
will be slightly augmented. The cart will be told its destination at
the start of the run, and at each stop will plan a path to the goal
that avoids all the obstacles perceived so far. Each meter-long lurch
will begin with a steering motion to keep the cart on that path.
\!picture((DSK:SEPT.14));
\!fig(5,(A view of the room mapped in the sample run.));
Digitizing
Digitizing images over a sometimes noisy radio link has its
problems. Though gaining adequate images is a crucial part of the cart
program, most of the techniques used are to solve problems peculiar to
our system, and are of limited general interest. We summarize them
here.
The picture is transmitted on a frequency of 524 MHz, a
wavelength of 57 cm. Reflections from nearby objects sometimes cancel
the signal to a receiver. The cart program is able to get its picture
through any one of a number of TV receivers (up to 4). It measures the
scanline to scanline noise on each in turn, and then picks the one
with lowest noise. In future we may simply switch antennas to a single
receiver.
Our decade old digitizer extracts a picture 280 samples wide
by 240 high by 4 bits/sample from broadcast standard video. Four bits
is only sixteen grey levels, and gives a poor quality image. Also,
sometimes the noise level over the link is high, and the picture has a
lot of "snow". These problems can be alleviated by adding up or
averaging a number of frames, to increase the number of bits and
average out noise. But this introduces problems of its own. The sync
detector on the digitizer sometimes fails, especially when the signal
is noisy, and the picture jitters or "rolls". Adding together
pictures that have rolled gives us a big blur. The problem has been
solved by a program that digitizes a large number of frames (about
30), and by an efficient sampling of the pixels intercompares them to
find the subset of pictures that are most nearly alike. These are then
combined. Since most images have not suffered sync loss, the result is
almost always a good six bit picture. New digitizing hardware that the
lab is in the process of ordering may soon make these gyrations
obsolete.
Picture Reduction
The cart vision programs make extensive use of reduced
versions of input images. Digitized pictures are compressed repeatedly
by linear factors of two, four pixels being added to become one, until
the whole image has shrunk to a single pixel. The original image and
all of its reductions are available to all subsequent processing
steps.
Calibration
The cart camera, like most vidicons, has peculiar geometric
properties. Its precision has been enhanced by an automatic focal
length and distortion determining program.
The cart is parked a precise distance in front of a wall of
many spots and one cross (Figure 3). The program digitizes an image of
the spot array, locates the spots and the cross, and constructs a a
two dimensional polynomial that relates the position of the spots in
the image to their position in an ideal unity focal length camera, and
another polynomial that converts points from the ideal camera to
points in the image. These polynomials are used to correct the
positions of perceived objects in later scenes.
The program tolerates a wide range of spot parameters (about 3
to 12 spots across), arbitrary image rotation, and is very
robust. After being intensely fiddled with to work successfully on an
initial set of 20 widely varying images, it has worked without error
on 50 successive images. The test pattern for the cart is a 3 meter
square painted on a wall, with 5 cm spots at 30 cm intervals. The
program has also been used successfully with a small array (22 x 28
cm) to calibrate cameras other than the cart's.
The algorithm reads in an image of such an array, and begins
by determining its approximate spacing and orientation. It trims the
picture to make it square, reduces it by averaging to 64 by 64,
calculates the Fourier transform of the reduced image and takes its
power spectrum, arriving at a 2D transform symmetric about the origin,
and having strong peaks at frequencies corresponding to the horizontal
and vertical spacing and half the diagonal spacing, with weaker peaks
at the harmonics. It multiplies each point [i,j] in this transform by
point [-j,i] and points [j-i,j+i] and [i+j,j-i], effectively folding
the primary peaks onto one another. The strongest peak in the 90
degree wedge around the y axis gives the spacing and orientation
information needed by the next part of the process.
The directional variance interest operator described later is
applied to roughly locate a spot near the center of the image. A
special operator examines a window surrounding this position,
generates a histogram of intensity values within the window, decides a
threshold for separating the black spot from the white background, and
calculates the centroid and first and second moment of the spot. This
operator is again applied at a displacement from the first centroid
indicated by the orientation and spacing of the grid, and so on, the
region of found spots growing outward from the seed.
A template for the expected appearance of the cross in the
middle of the array is constructed from the orientation/spacing
determined determined by the Fourier transform step. Each of the
found spots is matched to this cross template, and the closest match
in the central portion of the picture is declared to be the origin.
Least squares polynomial of about third degree in two
variables relating the actual positions of the spots to the ideal
positions are then generated and written into a file.
Choosing Features
\!graphic((DSK:INTOP.GOD));
\!fig(6,(A cart's eye view of the room from the starting
position of the mapping run, and features picked out by the interest
operator. They are labelled in order of decreasing interest measure.));
The cart vision code deals with very simple primitive
entities, localized regions called features. A feature is conceptually
a point in the three dimensional world, but it is found by examining
localities larger than points in pictures. A feature is good if it
can be located unambiguously in different views of a scene. A
uniformly colored region or a simple edge does not make for good
features because its parts are indistinguishable. Regions, such as
corners, with high contrast in orthogonal directions are best.
New features in images are picked by a subroutine called the
interest operator, an example of whose operation is displayed in
Figure 6. It tries to select a relatively uniform scattering, to
maximize the probability that a few features will be picked on every
visible object, and tries to choose areas that can be easily found in
other images. Both goals are achieved by returning regions that are
local maxima of a directional variance measure.
Directional variance is measured over small square windows.
Sums of squares of differences of pixels adjacent in each of four
directions (horizontal, vertical and two diagonals) over each window
are calculated, and the window's interest measure is the minimum of
these four sums.
Features are chosen where the interest measure has local
maxima. The feature is conceptually the point at the center of the
window with this locally maximal value.
The effects of noise are alleviated and the processing time is
shortened by applying the operator to a reduced image. In the current
program original images are 240 lines high by 256 pixels wide. The
interest operator is applied to the 120 by 128 version, on windows 3
pixels square.
Once a feature is chosen, its appearance is recorded as series
of excerpts from the reduced image sequence. A window (6 by 6 in the
current implementation) is excised around the feature's location from
each of the variously reduced pictures. Only a tiny fraction of the
area of the original (unreduced) image is extracted. Four times as
much of the x2 reduced image is stored, sixteen times as much of the
x4 reduction, and so on until at some level we have the whole
image. The final result is a series of 6 by 6 pictures, beginning with
a very blurry rendition of the whole picture, gradually zooming in
linear expansions of two to a sharp closeup of the feature. Of course,
it records the appearance correctly from only one point of view.
Finding Features Again
Deducing the 3D location of features from their projections in
2D images requires that we know their position in two or more such
images.
The correlator is a subroutine that, given a description of a
feature as produced by the interest operator from one image, finds the
best match in a different, but similar, image. Its search area can be
the entire new picture, or a rectangular sub-window.
\!grasml((DSK:CORLS.GOD));
\!fig(7,(Areas searched for in a successive refinement correlation.
The largest rectangle begins the process.));
\!grasml((DSK:CORRS.GOD));
\!fig(8,(Best matches to the areas outlined in Fig. 7 found by the
correlator in a different image.));
The search uses a coarse to fine strategy, illustrated in Figures
7 and 8, that begins in reduced versions of the pictures. Typically the
first step takes place at the x16 (linear) reduction level. The 6 by 6
window at that level in the feature description, that covers about one
seventh of the total area of the original picture, is convolved with the
search area in the correspondingly reduced version of the second picture.
The 6 by 6 description patch is moved pixel by pixel over the
approximately 15 by 16 destination picture, and a correlation coefficient
is calculated for each trial position.
The position with the best match is recorded. The 6x6 area it
occupies in the second picture is mapped to the x8 reduction level,
where the corresponding region is 12 pixels by 12. The 6 by 6 window
in the x8 reduced level of the feature description is then convolved
with this 12 by 12 area, and the position of best match is recorded
and used as a search area for the x4 level.
The process continues, matching smaller and smaller, but more
and more detailed windows until a 6 by 6 area is selected in the
unreduced picture.
The work at each level is about the same, finding a 6 by 6
window in a 12 by 12 search area. It involves 49 summations of 36
quantities. In our example there were 5 such levels. The correlation
measure used is \!jdiv((2 Sum ab),(Sum a^2 +
Sum b^2));, where a and b are the values of pixels in the two
windows being compared, with the mean of windows subtracted out, and
the sums are taken over the 36 elements of a 6 by 6 window. The
measure has limited tolerance to contrast differences.
The window sizes and other parameters are sometimes different
from the ones used in this example.
Slider Stereo
At each pause on its computer controlled itinerary the cart
slides its camera from left to right on the 52 cm track, taking 9
pictures at precise 6.5 cm intervals.
Points are chosen in the fifth (middle) of these 9 images,
either by the correlator to match features from previous positions, or
by the interest operator.
The camera slides parallel to the horizontal axis of the
(distortion corrected) camera co-ordinate system, so the parallax
induced apparent displacement of features from frame to frame in the 9
pictures is purely in the X direction.
The correlator looks for the points chosen in the central
image in each of the eight other pictures. The search is restricted to
a narrow horizontal band. This has little effect on the computation
time, but it reduces the probability of incorrect matches.
In the case of correct matches, the distance to the feature is
inversely proportional to its displacement from one image to another.
The uncertainty in such a measurement is the difference in distance a
shift one pixel in the image would make. The uncertainty varies
inversely with the physical separation of the camera positions where
the pictures were taken (the stereo baseline). Long baselines give
more accurate distance measurements.
After the correlation step the program knows a feature's
position in nine images. It considers each of the 36 possible image
pairings as a stereo baseline, and records the estimated distance to
the feature (actually inverse distance) in a histogram. Each
measurement adds a little normal curve to the histogram, with mean at
the estimated distance, and standard deviation inversely proportional
to the baseline, reflecting the uncertainty. The area under the each
curve is made proportional to the product of the correlation
coefficients of the matches in the two images (in central image this
coefficient is taken as unity), reflecting the confidence that the
correlations were correct. The area is also scaled by the normalized
dot products of X axis and the shift of the features in each of the
two baseline images from the central image. That is, a distance
measurement is penalized if there is significant motion of the feature
in the Y direction.
The distance to the feature is indicated by the largest peak
in the resulting histogram, if this peak is above a certain
threshold. If below, the feature is forgotten about.
The correlator frequently matches features incorrectly. The
distance measurements from incorrect matches in different pictures are
usually inconsistent. When the normal curves from 36 pictures pairs
are added up, the correct matches agree with each other, and build up
a large peak in the histogram, while incorrect matches spread
themselves more thinly. Two or three correct correlations out of the
eight will usually build a peak sufficient to offset a larger number
of errors.
\!graphw((DSK:DIST4.GOD));
\!fig(9,(A typical nine-eyed stereo feature ranging. The pictures around
the periphery are nine images acquired from a slider scan. The feature
being sought is marked by an asterisk in the central image. Its locations
in the other eight images, as determined by the correlator, are marked by
X's. Each image pair is used to calculate a probable distance to the
feature. The range so determined is recorded as a normal curve whose area
is proportional to the quality of the correlations of the pair used, and
whose standard deviation is the uncertainty in the ranging, based on the
one pixel resolution of the correlation. Picture pairs from long baselines
have a smaller deviation, an hence a narrower and taller curve than short
baseline pairs. The curves from the 36 possible pairings are summed, and
the peak in the resulting curve gives the distance. In the graph, the
inverse distance scale is graduated in meters, the individual normal
curves are drawn in plain lines and their sum, scaled to fit the diagram,
is the beaded curve. It has a peak at 7.1 meters. In the picture the
ranged object is the top of a box.));
In this way eight applications of a mildly reliable operator
interact to make a very reliable distance measurement. Figure 9 shows
a typical ranging. The small curves are measurements from individual
picture pairs, the beaded curve is the final histogram.
Motion Stereo
The cart navigates exclusively by vision. It deduces its own
motion from the apparent 3D shift of the features around it.
After having determined the 3D location of objects at one
position, the computer drives the cart about a meter forward.
At the new position it slides the camera and takes nine
pictures. The correlator is applied in an attempt to find all the
features successfully located at the previous position. Feature
descriptions extracted from the central image at the last position are
searched for in the central image at the new stopping place.
Slider stereo then determines the distance of the features so
found from the cart's new position. The program now knows the 3D
position of the features relative to its camera at the old and the new
locations. It can figure out its own movement by finding the 3D
co-ordinate transform that relates the two.
There can be mis-matches in the correlations between the
central images at two positions and, in spite of the eight way
redundancy, the slider distance measurements are sometimes in
error. Before the cart motion is deduced, the feature positions are
checked for consistency. Although it doesn't yet have the co-ordinate
transform between the old and new camera systems, the program knows
the distance between pairs of positions should be the same in both.
It makes a matrix in which element [i,j] is the absolute value of the
difference in distances between points i and j in the first and
second co-ordinate systems divided by the expected error (based on the
one pixel uncertainty of the ranging).
Each row of this matrix is summed, giving an indication of how
much each point disagrees with the other points. The idea is that
while points in error disagree with virtually all points, correct
positions agree with all the other correct ones, and disagree only
with the bad ones.
The worst point is deleted, and its effect is removed from the
remaining points in the row sums. This pruning is repeated until the
worst error is within the error expected from the ranging uncertainty.
After the pruning, the program has a number of points,
typically 10 to 20, whose position error is small and pretty well
known. The program trusts these, and records them in its world model,
unless it had already done so at a previous position. The pruned
points are forgotten forevermore.
Now comes the co-ordinate transform determining step. We need
to find a three dimensional rotation and translation that, if applied
to the co-ordinates of the features at the first position, minimizes
the sum of the squares of the distances between the transformed first
co-ordinates and the raw co-ordinates of the corresponding points at
the second position. Actually the quantity that's minimized is the
foregoing sum, but with each term divided by the square of the
uncertainty in the 3D position of the points involved, as deduced from
the one pixel shift rule. This weighting does not make the solution
more difficult.
\!fig(10,(Overhead view of the cart's world model from position to
position in the sample run, as deduced by the program purely from
perceived images. The world model starts out empty, and gets gradually
more cluttered as the program drives the cart forward and acquires and
confirms new features. The grid cells are two meters on a side. The
cart's position is indicated by the dark square. Each stopping point
on its path is marked by a dash (note that the cart failed to move
between the seventh and the eighth position. This was due to a bug in
the controlling routine, and was detected by the visual mapper). The
two lines radiating from the cart are the horizontal limits of the
camera's field of view. Each feature is marked by a circle that
approximates the range of uncertainty in the projection of its
position on the ground. The 45 degree line growing to the upper right
out of each circle indicates the feature's height above the floor, in
the same scale as the grid. Features whose height is less than or
equal to their position uncertainty are assumed to be marks on the
ground. The pictures below each map show the Cart's view from that
position. Each is the central image of a nine-eyed set.));
\!grabig((DSK:PLAN.14));
\!fig(11,(The world map at the final position. The graph on the left
(note the calibration in centimeters), is the program's determination
of the cart's height above the ground. Since the cart didn't actually
bounce up and down, it provides a measure of the ultimate accuracy of
the self model. The long vertical line through the circle of
uncertainty around each feature position is the distance uncertainty,
which is higher than the average uncertainty shown by the circle. The
picture is the central view from the final position.));
The error expression is expanded. It becomes a function of the
rotation and translation, with parameters that are the weighted averages
of the x, y and z co-ordinates of the features at the two positions, and
averages of their various cross-products. These averages need to be
determined only once, at the begining of the transform finding process.
To minimize the error expression, its partial derivative with
respect to each variable is set to zero. It is relatively easy to
simultaneously solve the three linear equations thus resulting from
the vector offset, getting the optimal offset values for a general
rotation. This gives symbolic expressions (linear combinations of the
rotation matrix coefficients) for each of the three vector
components. Substituting these values into the error expression makes
it a function of the rotation alone. This new, translation determined,
error expression is used in all the subsequent steps.
Minimizing the error expression under rotation is surprisingly
difficult, mainly because of the non-linear constraints in the 3D
rotation matrix. The next six paragraphs outline the struggle. Each
step was forced by the inadequacies of the previous one.
The program begins by ignoring the non-linearities. It solves
for the general 3D linear transformation, nine elements of a matrix,
that minimizes the least square error. The derivatives of the error
expression with respect to each of the matrix coefficients are equated
to zero, and the nine resulting simultaneous linear equations are
solved for the nine coefficients. If the points had undergone an
error-free rigid rotation and translation between the two positions,
the result would be the desired rotation matrix, and the problem would
be solved.
Because there are errors in the determined position of the
features, the resulting matrix is usually not simply a rotation, but
involves stretching and skewing. The program ortho-normalizes the
matrix. If the position errors were sufficiently small, this new
matrix would be our answer.
The errors are high enough to warrant adding the rigid
rotation constraints in the least squares minimization. The error
expression is converted from a linear expression in nine matrix
coefficients into an unavoidably non-linear function in three
parameters that uniquely characterize a rotation.
This new error expression is differentiated with respect to
each of the three rotation parameters, and the resulting expressions
are equated to zero, giving us three non-linear equations in three
unknowns. A strenuous attempt at an analytic solution of this
simultaneous non-linear system failed, so the program contains code to
solve the problem iteratively, by Newton's method.
The rotation expressed by the ortho-normalized matrix from the
previous step becomes the initial approximation. Newton's method for a
multi-variate system involves finding the partial derivative of each
expression whose root is sought with respect to each variable. In our
case there are three variables and three equations, and consequently
nine such derivatives. The nine derivatives, each a closed form
expression of the rotation variables, are the coefficients of a 3 by 3
covariance matrix that characterizes the first order changes in the
expressions whose roots are sought with the parameters. The next
Newton's method approximation is found by multiplying the inverse of
this matrix by the value of the root expressions, and subtracting the
resulting values (which will be 0 at the root) from the parameter
values of the previous approximation.
Four or five iterations usually brings the parameters to
within our floating point accuracy of the correct values.
Occasionally, when the errors in the determined feature locations are
high, the process does not converge. The program detects this by
noting the change in the original error expression from iteration to
iteration. In case of non-convergence, the program picks a random
rotation as a new starting point, and tries again. It is willing to
try up to several hundred times. The rotation with the smallest error
expression ever encountered during such a search (including the
initial approximation) is returned as the answer.
Since the summations over the co-ordinate cross-products are
done once and for all at the begining of the transformation
determination, each iteration, involving evaluation of about a dozen
moderately large expressions and a 3 by 3 matrix inversion, is
relatively fast. The whole solving process, even in cases of
pathological non-convergence, takes one or two seconds of computer
time.
Motion Stereo Mathematics
Going over the previous ground in gory detail, the least
squares expression to be minimized is
Error = \!jsum((i=1),(N));\!jdiv((
(X[1,i]-X\!jsab(('),(2,i));)^2
+ (Y[1,i]-Y\!jsab(('),(2,i));)^2
+ (Z[1,i]-Z\!jsab(('),(2,i));)^2),
(\!jxbp(U,i,2);));
where the X[1,i], Y[1,i] and
Z[1,i] are the co-ordinates of the features as observed
from the first cart position. X[2,i], Y[2,i] and
Z[2,i] are the co-ordinates of the points as seen from the
second position. X\!jsab(',(2,i ));, Y\!jsab(',(2,i )); and
Z\!jsab(',(2,i )); are the points from the second position subjected
to a co-ordinate transformation in the attempt to make them match the
first set. U[i] is the combined uncertainty in the position
of feature i in both frames. To get the most out of our data,
we need to weight features proportional to their accuracy.
To simplify the subsequent formulas, the U[i]'s are
normalized to make the sum of their inverse squares unity.
For a general linear transformation, these values are defined
by
[X\!jsab(',(2,i ));, Y\!jsab(',(2,i ));, Z\!jsab(',(2,i ));] =
[X[2,i], Y[2,i], Z[2,i]] \f1x
\!jmat(((((( AA ));((AB ));((AC ))));(((( BA ));((BB )
);((BC ))));(((( CA ));((CB ));((CC )))))); +
[X[c], Y[c], Z[c]]
where AA through CC is the "rotation" part of the
transform and X[c] through Z[c] is the
translation.
Error can be expanded out and the summation can be distributed
over the terms, giving us
Error =
\!jxbp((X),(c),(2));+\!jxbp((Y),(c),(2));+\!jxbp((Z),(c),(2));
+ \!jbar((\!jxbp((X),(1),(2));));+\!jbar((\!jxbp((Y),(1),(2));));+\!jbar((\!jxbp((Z),(1),(2));));
+ (AA^2 + AB^2 + AC^2)\!jbar((\!jxbp((X),(2),(2));));
+ (BA^2 + BB^2 + BC^2)\!jbar((\!jxbp((Y),(2),(2));));
+ (CA^2 + CB^2 + CC^2)\!jbar((\!jxbp((Z),(2),(2));));
- 2 \!jOP;
AA \!jbar((X[1]X[2]));+AB \!jbar((Y[1]X[2]));+AC \!jbar((Z[1]X[2]));
+BA \!jbar((X[1]Y[2]));+BB \!jbar((Y[1]Y[2]));+BC \!jbar((Z[1]Y[2]));
+CA \!jbar((X[1]Z[2]));+CB \!jbar((Y[1]Z[2]));+CC \!jbar((Z[1]Z[2]));
-(AA BA+AB BB+AC BC)\!jbar((X[2]Y[2]));
- (AA CA+AB CB+AC CC)\!jbar((X[2]Z[2]));
- (BA CA+BB CB+BC CC)\!jbar((Y[2]Z[2]));
+(\!jbar((X[1]));-AA \!jbar((X[2]));-BA \!jbar((Y[2]));-CA \!jbar((Z[2]));)X[c]
+(\!jbar((Y[1]));-AB \!jbar((X[2]));-BB \!jbar((Y[2]));-CB \!jbar((Z[2]));)Y[c]
+(\!jbar((Z[1]));-AC \!jbar((X[2]));-BC \!jbar((Y[2]));-CC \!jbar((Z[2]));)Z[c]
\!jCP;
where \!jbar((X[1])); is
\!jsum(i,);\!jdiv((X[1,i]),(\!jxbp(U,i,2);));, and
\!jbar((X[1]Z[2])); is
\!jsum(i,);\!jdiv((X[1,i]Z[2,i]),(\!jxbp(U,i,2);));, and
similarly for the other barred quantities. With the expression in this
form, the barred quantities can be evaluated once and for all at the start
of the solving.
Error can be differentiated partially with respect to
X[c], Y[c] and Z[c]. If the resulting
linear expressions are equated to zero and solved, we obtain:
X[c] = \!jbar((X[1])); - AA \!jbar((X[2]));
- BA \!jbar((Y[2])); - CA \!jbar((Z[2]));
Y[c] = \!jbar((Y[1])); - AB \!jbar((X[2]));
- BB \!jbar((Y[2])); - CB \!jbar((Z[2]));
Z[c] = \!jbar((Z[1])); - AC \!jbar((X[2]));
- BC \!jbar((Y[2])); - CC \!jbar((Z[2]));
These are the components of the translation vector that minimizes Error
for a general "rotation" matrix. They can be back substituted into Error,
making it a function of the rotation alone.
Error =
\!jbar((\!jxbp((X),(1),(2));)); + \!jbar((\!jxbp((Y),(1),(2));)); + \!jbar((\!jxbp((Z),(1),(2));));
+ (AA^2 + AB^2 + AC^2) \!jbar((\!jxbp((X),(2),(2));));
+ (BA^2 + BB^2 + BC^2) \!jbar((\!jxbp((Y),(2),(2));));
+ (CA^2 + CB^2 + CC^2) \!jbar((\!jxbp((Z),(2),(2));));
- 2 \!jOP;
AA \!jbar((X[1]X[2]));+AB \!jbar((Y[1]X[2]));+AC \!jbar((Z[1]X[2]));
+BA \!jbar((X[1]Y[2]));+BB \!jbar((Y[1]Y[2]));+BC \!jbar((Z[1]Y[2]));
+CA \!jbar((X[1]Z[2]));+CB \!jbar((Y[1]Z[2]));+CC \!jbar((Z[1]Z[2]));
-(AA BA+AB BB+AC BC)\!jbar((X[2]Y[2]));
- (AA CA+AB CB+AC CC)\!jbar((X[2]Z[2]));
- (BA CA+BB CB+BC CC)\!jbar((Y[2]Z[2])); \!jCP;
- (\!jbar((X[1]));-AA \!jbar((X[2]));-BA \!jbar((Y[2]));-CA \!jbar((Z[2]));)^2
- (\!jbar((Y[1]));-AB \!jbar((X[2]));-BB \!jbar((Y[2]));-CB \!jbar((Z[2]));)^2
- (\!jbar((Z[1]));-AC \!jbar((X[2]));-BC \!jbar((Y[2]));-CC \!jbar((Z[2]));)^2
The linear first approximation is now found by partially
differentiating this new Error with respect to AA through CC, setting
these derivatives to zero and simultaneously solving the system of
nine equations (which happen to be linear) so obtained for AA through
CC.
Now we get to the hard part. The matrix obtained in the
preceding step is a general 3x3 transform, and not necessarily a true
rotation. A true 3D rotation has only three free parameters, not
nine. However these three parameters are chosen, some or all of the
elements of a rotation matrix that characterizes the same rotation
will be non-linear functions of the three parameters.
We have chosen to represent rotations by a three element
subset of a quaternion. A quaternion is a four element
parameterization of a rotation, but any of the four elements may be
derived from the other three (the sum of their squares is unity).
Imagine an arbitrary axis through the origin, described by its
angles alpha, beta and gamma with the co-ordinate axes. A general
rotation can be specified as a twist of an angle pi around such an
axis. The quaternion for this rotation is the four element list
[P[0], P[x], P[y], P[z]] =
[cos pi/2 , cos alpha sin pi/2 , cos beta sin pi/2 , cos gamma sin pi/2]
The half angle components may seem a little peculiar, but the
manipulations are simplified because of them. The rotation matrix
corresponding to this quaternion is
\!jmat(((((( -1+2 \!jxbp((P),(0),(2));+2 \!jxbp((P),(x),(2)); )
);(( 2 P[x]P[y]-2 P[0]P[z] )
);(( 2 P[0]P[y]+2 P[x]P[z] )))
);(((( 2 P[x]P[y]+2 P[0]P[z] )
);(( -1+2 \!jxbp((P),(0),(2));+2 \!jxbp((P),(y),(2)); )
);(( -2 P[0]P[x]+2 P[y]P[z] )))
);(((( -2 P[0]P[y]+2 P[x]P[z] )
);(( 2 P[0]P[x]+2 P[y]P[z] )
);(( -1+2 \!jxbp((P),(0),(2));+2 \!jxbp((P),(z),(2)); ))))));
The P[0] element can be defined positive and equal to
\!jnrt(,(1-\!jxbp(P,x,2);-\!jxbp(P,y,2);-\!jxbp(P,z,2);));. With this
last substitution, the rotation is defined by three independent
variables.
Substituting these matrix elements (now involving square roots
for P[0]) in place of AA through CC in Error creates an
expression only a barbarian would inflict upon a reader.
Suffice it to say that the resulting expression is partially
differentiated with respect to P[x], P[y] and
P[z], and the resulting, very non-linear, expressions are
equated to zero. Let E represent the barbarous Error parameterized by
the three quaternion sine elements. Name the three expressions as
follows:
E[x] = \!jdiv((E),(P[x])); ,
E[y] = \!jdiv((E),(P[y])); ,
E[z] = \!jdiv((E),(P[z]));
The Newton's method iteration is defined by
[P[x], P[y], P[z]]' =
[P[x], P[y], P[z]]
- [E[x], E[y], E[z]]
\f1x exp((\!jmat(((((( \!jdiv((E[x]),(P[x])); )
);(( \!jdiv((E[x]),(P[y])); )
);(( \!jdiv((E[x]),(P[z])); )))
);(((( \!jdiv((E[y]),(P[x])); )
);(( \!jdiv((E[y]),(P[y])); )
);(( \!jdiv((E[y]),(P[z])); )))
);(((( \!jdiv((E[z]),(P[x])); )
);(( \!jdiv((E[z]),(P[y])); )
);(( \!jdiv((E[z]),(P[z])); ))))));), (-1));
The derivatives in the matrix, as well as E[x] through
E[z] are messy closed form expressions of P[x],
P[y] and P[z].
The bulk of the code in this part of the program, about a page
of solid formulas, was produced automatically by MIT's Macsyma
symbolic mathematics system.
Work in progress
The mapping program is being extended into its ultimate form,
a system that can navigate the cart to a desired location in an
unrestricted environment, avoiding obstacles along the way.
The key element is a routine that finds a short clear path
between two points in an obstacle field. This routine now exists as a
stand-alone program, and has not yet been incorporated into the
mapper.
Since the features located by the mapper are simple points in
the world, we must somehow add size and substance a posteriori. Each
located feature is assumed to be part of a cylinder whose axis is
vertical and passes through the feature, and whose radius is the
feature's position uncertainty, and whose height is the feature's
distance from the ground minus the position uncertainty. Cylinders so
defined are considered obstacles only if they protrude above the
ground.
In the current program the ground plane is defined a priori,
from the known height of the cart camera above the floor, and the
angle of the camera with respect to the horizontal (measured before a
run by a protractor/level). In later versions the cart will
dynamically update its ground plane orientation model by observing its
own motion as it drives forward. The endpoints of each meter-long
lurch define a straight line that is parallel to the local ground. The
vector component of the ground plane model in the direction of the
lurch will be tilted to match the observed cart motion, the component
perpendicular to that will be left unchanged. After moving in two
non-colinear lurches, all ground-plane orientation parameters will be
so updated. This process will allow the cart to keep its sanity while
traversing hilly terrain. Because the motion determination has short
term inaccuracies, the tilt model will be updated only fractionally at
each move, in the manner of exponential smoothing.
The cart itself has a radius of about 2 meters. It is
convenient to add this radius to each and every obstacle, and
conceptually shrink the cart to a point. The path finding problem then
becomes one of drawing a line (or pulling a string) between two points
through a maze of fattened cylindrical obstructions. Since the cart
moves in a plane, the problem is really two dimensional.
\!graphw((DSK:PATH.GOD));
\!fig(12,(The path finder at work in a randomly generated obstacle field.
The cart's radius has been moved from the cart to every obstacle.
It took about 15 compute seconds to find this route.));
An optimum path in such an environment will consist of either
a straight run between start and finish, or a series of tangential
segments between the cylinders and contacting arcs (imagine loosely
laying a string from start to finish between the cylinders, then
pulling it tight).
The number of tangent paths between n obstacles is 2
n(n-1), because each obstacle may be circumnavigated in either a
clockwise or a counter-clockwise direction. We can set up a table of
tangential distances between pairs of obstacles, with two entries for
each obstacle, one for each direction. If we set to infinity entries
for tangents that are blocked by some other obstacle, the resulting
table can be thought of as a distance matrix of a graph. The cost of
determining this table is on the order of n^3 because each
of the O(n^2) paths must checked for blockage against at least
a fraction of the other obstacles.
There are algorithms that take O(n^2) time to find the
shortest path between two points in a graph. The current program uses
one first described by Dijkstra [10]. Unfortunately, our tangent space
is not exactly a graph. Going from a to b via an obstacle c
involves not only the straight line tangential runs, but also a
circular segment at c. The program adds in the cost of this at a
point in the algorithm where the cost of going via an intermediate
point is calculated. This addition keeps the cost estimates realistic,
but destroys one of the pre-conditions of Dijkstra's algorithm. The
result is that it always finds a good path, but sometimes it overlooks
the optimal one.
Figure 12 is a sample demonstration of the path finder. The
circular obstacles were created by a random number generator. The
originating point is the lower left corner, the destination is the
upper right.
References
1. H.P. Moravec, "Towards Automatic Visual Obstacle Avoidance",
Proceedings of the 5th IJCAI, MIT, Cambridge, Mass., 1977, p. 584.
2. D.B. Gennery, "A Stereo Vision System for an Autonomous Vehicle",
Proceedings of the 5th IJCAI, MIT, August 1977.
3. R. Nevatia & T. O. Binford, "Description and Recognition of Curved
Objects," Artificial Intelligence, Vol. 8, pp.77-98, February 1977.
4. S. W. Zucker, "Relaxation Labelling and the Reduction of Local
Ambiguities," Third International Joint Conference on Pattern Recognition,
San Diego, November 1976.
5. R. Duda & P. Hart, Pattern Recognition and Scene Analysis, Wiley, 1973.
6. Y. Bard, Nonlinear Parameter Estimation, Academic Press, 1974.
7. F. A. Graybill, An Introduction to Linear Statistical Models
Volume I, McGraw-Hill Book Company, 1961.
8. R.K. Nevatia, "Depth Perception by Motion Stereo", Computer
Graphics and Image Processing, 1975.
9. R.A. Schmidt, "A Study of the Real-Time Control of a
Computer Driven Vehicle", Stanford AI Memo 149, August 1971.
10. E.W. Dijkstra, "A Note on Two Problems in Connection with Graphs,"
Numerische Mathematik, \fC1, 269-271 (1959).