Thesis Status for Martin C. Martin
Carnegie Mellon University Robotics Institute
for April 2, 2000

(The original thesis proposal is Breaking Out of the Black Box: A New Approach to Robot Perception, Martin, M. C., January 1998.
26 pages available in Abstract or in full as a 2.1 Meg Gzipped Postscript)

Introduction: The entire system at a glance

The goal of this thesis work is to circumvent certain problems that arise when people design robot architectures. Specifically, people divide robots into rather isolated subsystems or "black boxes," resulting in brittle systems. It is hoped that when architectures are "learned" through automated trial and error, the components will have more subtle interactions.
          To test this idea, I've chosen one particular embodiment of it. I've chosen the task of visual obstacle avoidance (using a video camera to move around without hitting anything.) In this framework the robot looks at the current image and perhaps previous images, dead reckoning and previous perceptions, to decide which direction to move in order to avoid any obstacles.
          To both bootstrap the process and provide ground truth, the system predicts sonar data from vision data. First, the robot is run in an office environment while using it's sonar to avoid obstacles. During this run, it records camera images, sonar data and dead reckoning. During the off-line learning procedure, it learns to predict the sonar readings from the visual information. Finally, the robot is run using the learned algorithm to produce "virtual" sonar readings, and navigates from those.
          The significant issues I encountered while implementing the above are:

These issues are discussed below.

Reworking the Representation of Learned Programs

The Representation According to the Proposal

As described in my Thesis Proposal, the learning system would be provided with a number of depth cues, and would learn how to combine these into depth estimates. Learning would be performed off line, however the resulting algorithms would be evaluated on a physical robot. None of that has changed.
          However, many of the details have. In this document I'll describe those details, then the problems I've encountered, and finally I'll describe and justify the new system.
          I proposed to estimate the image depth at 1200 points in the image, arranged in a 40 by 30 grid. At each point the system was to produce three estimates, each with a confidence. For ground truth, rather than create a 3D model of the environment and reliably locate the robot in it (a rather substantial task), I proposed to predict the next image, given the current image, the depth map and dead reckoning. As a second form of feedback, I planned to predict the sonar readings as well.
          To predict the depth at a given pixel, I was planning to look at a 24x24 neighbourhood only. In an area with little visual texture (e.g. a blank wall) this is often insufficient. To counter this, the depth map at the previous time step was to be fed back as input, allowing depth information to spread over larger distances through local interactions over time.

Problems with the Above Representation

As I implemented the above framework, certain problems because apparent.
          When the next image is predicted from the current image and the depth map, it must be compared to the image the camera actually sees. A straightforward way would be to simply subtract intensity values pixel per pixel, but that measures the wrong thing. What's important for depth perception is to compare the predicted and actual locations of each object. This could be done in the image, with displacement measured in number of pixels, or converted to 3D estimates and measured in meters. It's not clear which is the better metric, however they both suffer from the same problem. They're both similar to optical flow, and optical flow is known to have problems with rotation. When there is rotation between the two images, rotational errors completely swamp translational errors. This is true even with good dead reckoning, better than what I have available. That's why many optical flow systems use a mechanical device to rotationally stabilize the camera.
          Without stabilization (which would add a significant amount of time and complexity to the thesis), a better approach is stereo vision. Stereo vision can be used to create the ground truth in the form of a depth map. This depth map could even be corrected by hand. However, even stereo has a problem which ultimately affects optical flow as well: areas of low visual texture. In these areas noise dominates and both stereo and optical flow give no feedback (at best) or incorrect feedback (at worst.) In experiments with typical data, the vast majority of matches were of low or very low confidence.
          This is a general problem with using existing computer vision techniques. They can't provide an accurate, dense depth map. The alternatives (hand correcting those depth maps, hand constructing a model, generating ground truth using a different sensor) are all time consuming and perhaps miss the point. A dense 3D depth map is most likely overkill for navigation. Robots navigate very well from sonar, and some of the visual navigation systems listed under "Previous Work" work quite well while producing only a very low resolution depth map. If we force the learning to produce such a map, it will never find these "low resolution" approaches.
          And finally, implementing non-local interactions through iterated local interactions is somewhat indirect and unnatural. That may make it difficult for the learning to discover them. It's time to set aside the dense depth map as the internal representation, step back, and look at the big picture in search of alternatives.

A Modicum of Structure

In this thesis I want the learning process to get away from the architectures traditionally used in Computer Vision. However, we need a representation for algorithms, and any representation makes certain programs easier to find, and others harder. Among traditional programming languages, functional, procedural and object oriented styles lead to different approaches to coding. What's more, given the current limited power of genetic algorithms and computer hardware, we must "help it along" by providing a representation close to the problem domain. How, then, can we provide a representation without forcing an architecture? Once again, the crucial difference is between hard and soft. We choose building blocks which each do a significant amount of work, but can be put together in a large number of ways, some quite subtle. For a non-computational example, LEGO building blocks simplify the process of mechanical design and construction, and can be put together in many interesting and creative ways. Thus the architectures that arise in practice cover a limited but interesting space.
          In creating this representation, one interesting goal is to make it easy to represent most existing, successful algorithms. This has several advantages. It's "safe," in that we're likely to do at least as well as existing algorithms. Also, if the commonalities among all these programs capture some structure of the problem domain, the current system will likely exploit them. However, it is a tenant of this work that commonalities can also reflect the affordences of the minds that produce them. Generalizing from existing work limits how far the system can get away from our own mind sets.
          To summarize, given the current power of genetic algorithms and computer hardware, we must choose a representation that's close to the problem domain. I have chosen to generalize from existing work in a relatively conservative way, to increase the chance of basic competency. If this work succeeds, it is hoped that others will make more adventurous forays in representation.

Searching for Structure: A Summary of Existing Work

Here is a summary of existing, successful visual obstacle avoidance systems. This is from my Thesis proposal, and are collected here with the hope of finding common elements to use as a basis for the representation.

          Larry Matthies and company at JPL
          This group uses stereo vision, post processed using consistency checks and filters. Stereo vision compares a window in one image against a series of same-sized windows in the other image. The series of windows are typically evenly spaced along a horizontal line.

          Liana Lorigo at MIT
          A histogram is computed over a series of windows. The windows are evenly spaced along a vertical line. The bottom most window is assumed to represent the ground, and the first window above that with a significantly different histogram is said to represent an obstacle. The further up the image this obstacle is found, the further away it is assumed to be.

          Ian Horswill and Polly the Robot
          A predecessor to Liana Lorigo's work, Ian Horswill's method differs only by assuming the ground is untextured, and searching for the first textured window while moving up the image.

          Rattler at CMU
          The Rattler trials used stereo vision, for our purposes similar to the JPL work.

          David Coombs and company at NIST
          This group uses optical flow. Each rectangle in one image is compared with a collection of rectangles in previous images to find the most similar. The robot steers away from the side with the most optical flow.

          Illah Nourbakhsh at CMU
          Three relatively narrow depth of field cameras are mounted on a robot and focused at three different distances. The three images are divided into a number of rectangles, and in each rectangle the sharpness of the image is computed.

Representation Fundamentals: The Iterated Rectangle

The common element I've chosen from the existing algorithms is the "iterated rectangle." All the existing algorithms perform some local analysis around each pixel in the image. Those are then combined over a rectangle to produce a handful of numbers. The process is repeated as the rectangle is moved horizontally or vertically.
          Specifically, for a given frame, each evolved program is invoked once for each sonar sensor. The single real number it returns is interpreted as an estimate of the sonar reading in that direction. The programs can make use of one or more "rectangle functions" (see Figure below.)These functions consist of a rectangle size, a direction to iterate (either horizontal or vertical), a starting location for the rectangle (x, y position of it's center), an ending location, and an expression evaluated for each position of the rectangle. The expression has access to summary statistics of various operators such as a Sobel edge detector, a median filter and the Moravec interest operator.

          For example, a simple version of Liana Lorigo's algorithm is shown above. The root of the tree (iterate-vert) says we'll be moving a rectangle vertically. The first four arguments mean we're using a 20x20 rectangle, whose center starts at the coordinate (angle, image_max_y) and whose end is at (angle, image_max_y + 1). "Angle" is the x coordinate in the image corresponding to the current sonar direction, and the ending coordinate is equivalent to (angle, 0), i.e. the top of the image. Therefore, this rectangle will move from the bottom up, and at every location it will execute it's fifth argument, the rest of the tree. The value returned by this argument is discarded.
          The "prog2" executes each of it's arguments in order. In this case, it first sets register "d" to the y position, then checks to see if 0.5 <= first-rect. Since first-rect is one the first time through (i.e. at the bottom of the image) and zero thereafter, the left branch is executed the first time, and the right branch all subsequent times.
          The left branch simply sets registers a and b to values representing summaries of the image over the 20 x 20 rectangle. For example, register a is set to the average of the Sobel gradient magnitude over those 400 pixels.
          Finally, on subsequent iterations, the remaining branch compares the stored values to freshly computed ones at the new location of the rectangle. If the difference is small (if (a - sobel_mag)^2 <= 3.2 and (b - (median_corner + moravec_horiz))^2 < 6.7) we keep iterating, otherwise we execute a "break" which stops the iteration. Note that the value in register "d" will be the y coordinate of the rectangle of the break, or the y coordinate of the highest rectangle if no break occurred.

The Result Producing Branch

In general, an individual may have up to three iteration loops, and each loop may modify up to five registers. That leaves up to fifteen values that must be combined and scaled to produced the predicted sonar reading. That is the job of the result producing branch.
          The result producing branch takes as inputs (i.e. leaves of it's tree) only the final register values and fixed numerical constants. It's functions are standard mathematical functions and Automatically Defined Functions, as described in [Koza 1994 Genetic Programming II: Automatic Discovery of Reusable Programs]

Adventures in Navigation: The Sonar Baseline

Abstract: The learning framework in this thesis endeavors to evolve algorithms that predict sonar readings from vision and dead reckoning. Therefore, we need to collect video, sonar and position information to use as training data, and we want to compare performance to sonar based navigation. To collect a stream that's representative of what the cameras might see during visual navigation, the robot collects data while navigating under sonar.
          While obstacle avoidance under sonar is easier than under vision, it still took many attempts to get a working system. The method that proved most successful determines speed based on proximity to the nearest object, and determined direction of travel by fitting lines to points on the left and right sides of the robot.


When embarking on this thesis, it was pointed out to me that robot navigation using vision performs worse than navigation from stereo. (In this thesis, "navigation" simply means "moving while avoiding obstacles." Also, all motion has some forward component; currently, the robot can't get out of a dead end.) As such, I'm using navigation from sonar as my baseline. This was reason enough to implement sonar navigation, but there was another.
          With any learning technique, it's important for the training data to be as representative as possible. For example, if the training data for this thesis was collected while moving straight down the middle of a corridor, then the center of the image would almost always be far away. Simply assuming that the forward direction is always far away will get you pretty far. However, when the robot is navigating on it's own, as soon as it turns away from straight forward, this assumption will be wrong. It will be in a circumstance it has never encountered during training, and will most likely screw up.
          As such, I wanted to collect data while it was moving and avoiding obstacles using vision. Of course, this is a catch-22: I want it to learn to navigate before collecting data, but it needs the data to learn to navigate. Instead, I chose the next best thing: to collect data while navigating using sonar.

The Selected Algorithm

The method that proved most successful only looked at objects near the robot (see Figure below.) The union of all the colored areas is called the "area of attention;" any objects outside of it are completely ignored. The selected algorithm starts by dividing the environment into six regions, "front" and "sides" of "near," "middle" and "far." For this "cased based" approach I am indebted to Illah Nourbakhsh [Nourbakhsh 2000, Property Mapping: A Simple Technique for Mobile Robot Programming]. The regions have the following semantics:

          If the robot is panic halted for more than a second, it picks a direction to turn (left or right), then keeps turning in this direction until the panic halt clears, i.e. until there are no obstacles in the "near" region. If the robot isn't panic halted, it looks for readings in "middle sides." If it finds any, then for each side, it chooses the two closest to its center line and fits a line to them (see Figure below). If both lines are diverging from the robot, it faces straight ahead. If one line is diverging and the other is converging, the robot faces parallel to the converging line. Finally, if both lines are converging, the robot aims for the intersection point. No readings on a given side count as a diverging line.

          The Uranus robot has a three degree of freedom base, that is, it doesn't have to move in the direction it's facing, but can move sideways or at an angle. In fact, the direction of motion can be chosen independently of the direction the robot faces. In every other case the robot always moves in the direction it faces, but when their are readings in "middle sides" is uses a potential field to help guide it. This is a great help when navigating through doorways.
          If the middle sides are clear but there's an object either far or middle front, the robot classifies the readings within area of attention as "obstacles," and those outside as "non-obstacles." It faces the largest gap, i.e. the largest number of contiguous non-obstacle sonars. If the only sonar returns are from the "far sides" area, it fits lines to both sides, finds the closest line, and faces parallel to that.


This "straight line model" of the environment works well when navigating down hallways, quickly aligning itself with the walls, yet moving away from them if it becomes too close (i.e. in the middle region.) It also turns corners well, the biggest gap usually corresponding to the direction that the hallway bends. Occasionally it turns the wrong way for a second or so before "changing it's mind," but never gets stuck in a corner.
          This straight line model also works well in our cluttered lab or when navigating through doorways, even the doorway out of the lab with a cluttered table on one side. Because the line is fit to only two readings, it seems to work more like a derivative or a local tendency in the area of most danger. The robot can usually navigate out of the lab and into the hallway beyond. It should be kept in mind that this is particularly difficult for our robot, since it's only a little narrower than the door, and when turned at even a slight angle it's rectangular base will bump the doorframe.

Initial Results

The initial results aren't very interesting. Although I haven't analyzed them in depth, the first runs seem to converge on a constant that approximates the average sonar reading. This is expected, as there are many options to tweak. Most individuals don't use the results of the iterations at all, and often those that do use registers that haven't been touched. The most obvious way to change that is to increase both the population size and the size of the initially created individuals.
          The system hasn't been fully tested, so it's quite possible that bugs remain. My next step will be to implement some existing algorithms, such as the one in Figure 1. It will be interesting to compare this with what the learning discovers, or to seed the learning with it. The number of possible experiments is large, and it's hard to say where they go or what it will take to make this system a success. This is without a doubt the most interesting phase of my thesis work, and I'm very much looking forward to it.


          Koza, J.R. (1994). Genetic Programming II: Automatic Discover of Reusable Programs. Cambridge: MIT Press.
          Nourbakhsh, I. (2000). Property Mapping: A Simple Technique for Mobile Robot Programming. Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000).