Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover, Hans Moravec, 1980
<-- Previous  Next -->

## Chapter 6: Correlation

Deducing the 3D location of features from their projections in 2D images requires that we know their position in two or more such images.

The correlator is a subroutine that, given a description of a feature as produced by the interest operator from one image, finds the best match in a different, but similar, image. Its search area can be the entire new picture, or a rectangular sub-window. Figure 6.1: Areas matched in a binary search correlation. Picture at top contains originally chosen feature. The outlined areas in it are the prototypes which are searched for in the bottom picture. The largest rectangle is matched first, and the area of best match in the second picture becomes the search area for the next smaller rectangle. The larger the rectangle, the lower the resolution of the pictures in which the matching is done.  Figure 6.2: The “conventional” representation of a feature used in documents such as this one, and a more realistic version which graphically demonstrates the reduced resolution of the larger windows. The bottom picture was reconstructed entirely from the window sequence used with a binary search correlation. The coarse outer windows were interpolated to reduce quantization artifacts.

The search uses a coarse to fine strategy, illustrated in Figure 6-1, that begins in reduced versions of the pictures. Typically the first step takes place at the $\times 16$ (linear) reduction level. The $6 \times 6$ window at that level in the feature description, that covers about one seventh of the total area of the original picture, is convolved with the search area in the correspondingly reduced version of the second picture. The $6 \times 6$ description patch is moved pixel by pixel over the approximately $15$ by $16$ destination picture, and a correlation coefficient is calculated for each trial position.

The position with the best match is recorded. The $6 \times 6$ area it occupies in the second picture is mapped to the $\times 8$ reduction level, where the corresponding region is $12$ pixels by $12$. The $6 \times 6$ window in the $\times 8$ reduced level of the feature description is then convolved with this $12$ by $12$ area, and the position of best match is recorded and used as a search area for the $\times 4$ level.

The process continues, matching smaller and smaller, but more and more detailed windows until a $6 \times 6$ area is selected in the unreduced picture.

The work at each level is about the same, finding a $6 \times 6$ window in a $12$ by $12$ search area. It involves 49 summations of 36 quantities. In our example there were 5 such levels. The correlation measure used is ${2\sum ab}/({\sum a^2}+{\sum b^2})$, where $a$ and $b$ are the values of pixels in the two windows being compared, with the mean of windows subtracted out, and the sums are taken over the $36$ elements of a $6 \times 6$ window. The measure has limited tolerance to contrast differences.

The window sizes and other parameters are sometimes different from the ones used in this example.

In general, the program thus locates a huge general area around the feature in a very coarse version of the images, and successively refines the position, finding smaller and smaller areas in finer and finer representations. For windows of size $n$, the work at each level is approximately that of finding an $n$ by $n$ window in a $2n$ by $2n$ area, and there are $\log_2(w/n)$ levels, where $w$ is the smaller dimension of the search rectangle, in unreduced picture pixels.

This approach has many advantages over a simple pass of of a correlation coefficient computation over the search window. The most obvious is speed. A scan of an $8 \times 8$ window over a $256$ by $256$ picture would require $249 \times 249 \times 8 \times 8$ comparisons of individual pixels. The binary method needs only about $5 \times 81 \times 8 \times 8$, about $150$ times fewer. The advantage is lower for smaller search areas. Perhaps more important is the fact that the simple method exhibits a serious jigsaw puzzle effect. The $8 \times 8$ patch is matched without any reference to context, and a match is often found in totally unrelated parts of the picture. The binary search technique uses the general context to guide the high resolution comparisons. This makes possible yet another speedup, because smaller windows can be used. Window sizes as small as $2 \times 2$ work reasonably well. The searches at very coarse levels rarely return mismatches, possibly because noise is averaged out in the reduction process, causing comparisons to be more stable. Reduced images are also more tolerant of geometric distortions. Figure 6.3: Example of the correlator's performance on a difficult example. The interest operator has chosen features in the upper image, and the correlator has attempted to find corresponding regions in the lower one. The cart moved about one and a half meters forward between the images. Some mistakes are evident. The correlator had no a-priori knowledge about the relationship of the two images and the entire second image was searched for each feature. Figure 6.4: An outdoor application of the binary search correlator

The current routine uses a measure for the measure for the cross correlation which I call pseudo normalized, given by the formula $${2 \sum{ab} \over \sum{a^2} + \sum{b^2}}$$ that has limited contrast sensitivity, avoids the degeneracies of normalized correlation on informationless windows, and is slightly cheaper to compute. A description of its derivation may be found in Appendix 6.

### Timing

The formula above is expressed in terms of $A$ and $B$ with the means subtracted out. It can be translated into an expression involving $\sum{A}$, $\sum{A^2}$, $\sum{B}$, $\sum{B^2}$ and $\sum{(A-B)^2}$. By evaluating the terms involving only $A$, the source window, outside of the main correlation loop, the work in the inner loop can be reduced to evaluating $\sum{B}$, $\sum{B^2}$ and $\sum{(A-B)^2}$. This is done in three PDP-10 machine instructions per point by using a table in which entry $i$ contains both $i$ and $i^2$ in subfields, and by generating in-line code representing the source window, three instructions per pixel, eliminating the need for inner loop end tests and enabling the $A-B$ computation to be done during indexing.

Each pixel comparison takes about one microsecond. The time required to locate an $8 \times 8$ window in a $16$ by $16$ search area is about $10$ milliseconds. A single feature requires $5$ such searches, for a total per feature time of $50$ ms.

One of the three instructions could be eliminated if $\sum{B}$ and $\sum{B^2}$ were precomputed for every position in the picture. This can be done incrementally, involving examination of each pixel only twice, and would result in an overall speedup if many features are to be searched for in the same general area.

The correlator has approximately a 10% error rate on features selected by the interest operator in our sample pictures. Typical image pairs are generally taken about two feet apart with a $60$° field of view camera.

<-- Previous  Next -->