Edge Detection
Edge detection is used in computer vision, normally for feature detection and feature extraction. It refers to a body of algorithms used to identify pixels in an image where the brightness contrasts sharply, indicating an edge. Edge detection has a very long history in computer vision and many methods were developed. This section will skim the surface and cover the most widely discussed approaches only. Further readings can be found at Wikipedia.
What Are Edges and Why Do We Care?
Intuitively, edges are areas of an image where there is a rapid change in the local brightness of an image. These rapid changes can be due to many factors: changes in material properties (e.g. color), changes in lighting (e.g. a shadow), or they may reflect underlying geometry (e.g. occlusions). Edges may appear as a step edge, a ramp edge or a roof edge to name a few common forms.
It is important to realize that what we might want to detect as an edge can very widely depending on our ultimate objectives, and the likely range of imaging conditions. As usual, part of the art of engineering a vision application is to define what is actually needed, and adapt techniques to that need.
Edge Scale Space
In our previous chapter on filtering, we already saw the notion of scale space. You should review the material on scale space.
To reiterate what was said above, the notion of what edges are important in an image is highly subjective. Suppose for example, we are looking at a picture of a house sitting on a sea of grass. As a fine scale, we could detect edges corresponding to every blade of grass. If we zoom out a bit, we might start to focus on the edges formed by the shingles and the bricks. Finally, at a very large scale, we would see the edges formed by the walls of the building.
Canny Edge Detection
There is a good overview of Canny edge detection at Wikepedia.
Texture Features
As noted above, the idea of detecting features is to isolate and represent areas of an image that carry high information content,however we choose to define that term. Edges are an example of extracting some geometric element/information from images. That is, the process of edge detection starts with a set of pixels and ends with a curve in the plane. However, one might ask whether it is wise to throw away all of the information contained in the pixels themselves. Here, we explore the process of instead capturing that local pixel information
Templates and template matching
Suppose our goal is to be able to determine if two images have a common subregion — that is, they both contain an area with similar appearance. Good solutions to this problem would have profound implications for computer vision, as it would mean we could the matching of subregions as a basis for many many applications including computational stereo, image mosaicking, and object recognition to name three examples.
Our goal in this subsection is to explore several comment methods for performing region matching in images, and also to illustrate some of the challenges in doing so. This discussion will set the stage for later chapters where we take feature detection and matching as a given.
To begin, let us consider two distinct images <math>I_1 </math> and <math>I_2 </math>. We will assume that some region <math>R_1 = \{q_1, q_2, \ldots q_n\} </math> is identified in <math>I_1 </math> and our goal is to determine if there is a displacement <math> d\in \Re^2 </math> so that <math>R_2 = R_1 + d = \{q_1+d, q_2+d, \ldots q_n+d\}</math> such that <math>R_1 </math> in <math>I_1 </math> represents an image of the same patch of surface as <math>R_2 </math> in <math>I_2. </math> This is a fairly simplistic and somewhat imprecise formulation of the problem, but it will do for now to illustrate several important methods and related issues.
For now, we take the images in question to be gray-scale — that is, each pixel is a single scalar brightness value. Furthermore, suppose that both <math>I_1 </math> and <math>I_2 </math> follow the additive noise model described in the previous section, that is
<math>I_x(q) = I(q,t_x) = I*(q, t_x) + n(q,t_x),\ x=1,2 </math>
Restating our problem, posit that, if there is a matching region in <math>I_2 </math>, then there is a value <math>d* </math> so that
<math>I^*(q,t_1) = I^*(q+d^*,t_2)\ \forall q \in R_1 </math>
Of source, we don’t have access to <math>I^*, </math> but just its noise contaminated observation <math>I. </math>. But, if <math> n(q,t) </math> is zero-mean white Gaussian noise with variance <math>\sigma^2 </math> (that is, i.i.d. and spatially and temporally independent <math>N(0,\sigma^2) </math> Gaussian), then it follows that
<math> I_1(q) – I_2(q+d^*) \sim N(0, 2\sigma^2) </math>
Of course, we don’t know <math>d^*</math>. But, if we sum over the region, we can define
<math>SSD(d) = \sum_{q\in R_1} (I_1(q) – I_2(q+d))^2/n </math>
The SSD here stands for Sum of Squared-Differences. This is a widely used measure of similarity. It is not hard to show that
<math>E(SSD(d)) \ge 2 \sigma^2 \forall d </math>
and most importantly that
<math>E(SSD(d^*)) = 2 \sigma^2 </math> if <math>d = d^*. </math>
As a result, a good procedure for finding the location for a matching region is to simply find a value <math>\hat{d} </math> (we use the notion <math>\hat{d} </math> to represent an estimate of the true value <math>d* </math>) such that
<math>SSD(\hat{d}) = min_d SSD(d) </math>
If our assumptions are true, then the odds that an incorrect value of <math>\hat{d} </math> would achieve a better SSD measure than the correct value <math>d^* </math> seem small. In fact, under suitable assumptions about the image, we can even put bounds on this value. However, we leave this as an exercise.
So, it seems our problem is solved. Well, not quite so fast. Recall our discussion of lighting earlier. Suppose the surface being imaged is planar, Lambertian, and viewed under a point light source. Then, the measured brightness of the patch will vary depending on the location of the source by a constant gain factor <math>c </math> (can you prove this?). Furthermore, recall cameras often modify the shutter, exposure, and brightness of the acquired image. So in fact, a better model for our problem is that images are formed by
<math>I_x(q) = I(q,t_x) = c_x I^*(q, t_x) + a_x + n(q,t_x),\ x=1,2 </math>
This will create a potential problem: it is quite possible that the correct image region will have a larger SSD value (due to its particular choices of gain and brightness offset) than some incorrect choice.
We can fix this problem as follows. First, compute
<math>\bar{I}_1 = \sum_{q\in\R_1} I_1(q)/n </math> and <math>\bar{I}_2(d) = \sum_{q\in\R_1} I_2(q+d)/n </math>.
and define
<math>\tilde{I}_1(q) = I_1(q) – \bar{I}_1 </math> and <math> \tilde{I}_2(q + d) = I_2(q+d) – \bar{I}_2(d) </math>
Notice that now, any constant offset in <math>I_1 </math> or <math>I_2 </math> will be removed in the corresponding <math>\tilde{I} </math>.
Now, compute
<math>\bar{\sigma}^2_1 = \sum_{q\in\R_1} \tilde{I}_1^2/n </math> and <math>\bar{\sigma}^2_2(d) = \sum_{q\in\R_1} \tilde{I}_2(q+d)^2/n </math>.
Finally, we modify our SSD measure to be
<math>SSD(d) = \sum_{q\in R_1} (\frac{\tilde{I}_1(q)}{\bar{\sigma_1}} – \frac{\tilde{I}_2(q+d)} {\bar{\sigma}_2(d)})^2/n </math>
We leave it as an exercise to show that this version of SSD is now invariant to (uniform) changes in brightness and contrast.
If we expand the expression, we find that we have
<math>SSD(d) = \sum_{q\in R_1} \frac{\tilde{I}_1(q)^2/n}{\bar{\sigma}_1^2} + \sum_{q\in R_1} \frac{\tilde{I}_2(q+d)^2/n} {\bar{\sigma}_2^2(d)} -2\sum_{q\in R_1}\frac{\tilde{I_1}\tilde{I_2}}{\bar{\sigma}_1\bar{\sigma}_2(d)} = -2\sum_{q\in R_1}\frac{\tilde{I_1}\tilde{I_2}}{\bar{\sigma}_1\bar{\sigma}_2(d)} +2
</math>
Note that this is equivalent to maximizing
<math>ZNCC(d) = \sum_{q\in R_1}\frac{\tilde{I_1}(q)\tilde{I_2}(q+d)}{\bar{\sigma}_1\bar{\sigma}_2(d)} </math>
We have seen this type of expression before. Recall, we consider convolution as a pattern matching operation, by thinking of the kernel and the neighborhood of a pixel each as vectors. Here, we have done the same: we are effectively computing the dot product between two normalized (by <math>\bar{\sigma}^2 </math> vectors. The unnormalized version of the quantity is referred to as the correlation between two signals, and the version we have written as the normalized cross-correlation (NCC) between the two signals. Furthermore, we operate on zero-mean signals (due to the mean subtraction earlier), and so we have called this zero-mean normalized cross-correlation (ZNCC).
These varations on correlation — pure correlation, zero-mean cross correlation, and zero-mean normalized cross-correlation — comprise a new family of similarity measures.
Final, we note that we still had to make quite strong assumptions about the imaged regions to derive these measures. In practice, regions are often specular and they are non-planar, and so they exhibit strikingly non-uniform brightness variation. It is not hard to show that squaring the difference of image values is very sensitive to this type of variation. Thus, a slightly more robust variant is the sum of absolute differences (SAD) given by
<math>SAD(d) = \sum_{q\in R_1} | I_1(q) – I_2(q+d) |/n </math>
It is not hard to derive the equivalent zero-mean SAD (ZSAD) and and zero-mean normalized (ZNSAD) ZNSAD measures. Of these, ZSAD is probably the most widely used variant.
Texture Feature Selection
Read about the Harris & Stephens corner detection algorithm here and also about the multi-scale Harris operator here.
Describing Features using Histograms
One of the challenges in feature description is the fact that almost any motion on the world will tend to change the geometry and intensity of the pixels in a region. As a result, most of the template matching methods described previously will fail quickly. There have been a variety of attempts to make features robust to this type of variation. A popular method is to use either histograms or spatially weighted histograms of feature values. Here, we briefly describe this approach to representation, and some of the matching methods that can be applied to histograms of values.
To start with something simple, suppose we simply use an intensity histogram. Let <math> \tau </math> be a bin radius, <math> a_1 … a_n </math> be <math> n </math> bin centers. We can then define the binning function <math> B_a(x) = 1 </math> if <math> a – \tau < x \le a+\tau </math>, and <math> B_a(x) = 0 </math> otherwise. The the value of the histogram at bin <math> a </math> is
<math> h(a; R) = \sum_{q \in R} B_a(I(q))/n </math>
Assuming that our bins cover the entire intensity space, it is easy to see that <math> \sum_i h(a_i) = 1 </math>
A slightly more general form of this is to assume that there is a spatial weighting kernel <math> K </math> centered at some location <math> c </math>. For example, <math> K </math> might be a Gaussian function with mean <math> c </math>. Let <math> L </math> denote the set of all locations in an image. Then we can write
<math> h(a; c) = \sum_{q \in L} K(q-c) B_a(I(q))/N(c) </math> where <math> N(c) = \sum_{q \in L} K(q-c) </math>
Note our previous formulation is a special case where the kernel function is a constant-valued mask that defines the region of interest. However, the more general formulation allows us to weight values that are nearer the center of the region more heavily than those further away, for example.
Next, we will consider first a histogram of the gradients in a region. Recall that each pixel location has a gradient value that can be computed using a derivative of Gaussian filter. Let <math>g(q) \in \Re^2</math> denote the gradient vector at a pixel location <math>q. </math>
Note that taking the gradient immediately removes any constant brightness offset in the signal. However, the magnitude of the gradient will still vary with the contrast in the image.
Now, let <math>\theta(g(q)) </math> represent the angle that <math>g(q) </math> makes with the image <math>u </math> axis. Let us then construct a histogram on a region <math>R </math> of <math>n </math> pixel locations by computing
<math>h(a) = \sum_{q\in\R} B_a(\theta(g(q)))/n </math>
where <math>B_a(\theta) </math> is again a binning function for the histogram value <math> a </math>.
We compute this for all values <math>a. </math> We then do one other thing. We find the value of <math>a </math> for which <math>h </math> is largest, and we perform a circular rotation of the vector to place <math>a </math> at the beginning of the vector.
The resulting vector <math>h </math> has the following properties:
- It is invariant to spatial reorderings of pixels
- It is invariant to rotation, reflection
- It is insensitive to corruption of small groups of pixels
- Can be made invariant to both brightness and contrast changes using same normalization tricks as before
In short, it is a very robust descriptor.
If we have two histograms, there are a variety of metrics for comparing them. Let <math>A </math> and <math>B </math> be two histograms with <math>n </math> bins. Three common metrics are:
<math> SSD(A,B) = \sum_{i=1}^{n} (A_i – B_i)^2 </math>
<math> NCC(A,B) = \sum_{i=1}^{n} A_i \cdot B_i </math>
<math> INTERSECTION(A,B) = \sum_{i=1}^{n} \min(A_i, B_i) </math>
SIFT Features
SIFT (Scale-Invariant Feature Transform) is an method developed by David Lowe at the University of British Columbia. It brings together several of the ideas discussed above, and develops them into what has become the state of the art in feature detection and representation. A good description of SIFT is found here.
Image Registration Methods Using Feature Matching
Image registration is the process of locating a geometric transformation that brings images into alignment (i.e. they appear similar under the chosen transformation). The literature on image registration is enormous and is impossible to treat in this brief introduction. A brief overview of feature matching methods for registration is found here. In this section, our main goal is to introduce the problem, and to describe a couple of common techniques for solving it.
Image Registration Formally Defined
Given two images <math> I_1 </math> and <math> I_2 </math>, a family of spatial transformations <math> T </math> with parameters <math> \mu </math>, and an image similarity measure <math> S </math>, find the value <math> \mu^* </math> such that
<math> S(I_1,T(I_2,\mu^*)) = max_\mu S(I_1,T(I_2,\mu)) </math>
For example, <math> T </math> might be translation and rotation of the image, and S might be normalized cross correlation.
Voting Methods for Solving Image Registration
One of the major problems in feature-based computer-vision is the fact that feature matches are almost always prone to two types of error: 1) localization error due to noise or similar corruption of the image; and 2) match error due to noise, image corruption, similar appearing regions, and so forth. The former is readily dealt with using traditional estimation methods. The latter is a more difficult problem. A typical approach is make use of a voting method to find solutions that are supported consistently by the data. We will look at two common voting methods: the Hough transform (mostly now historical) and RANSAC (widely used and constantly improved).
Read about the RANSAC method.
Read about the Hough transform method.
An example: object detection using SIFT
We can now put together everything we’ve learned to attack an interesting problem: recognizing objects in different poses using SIFT features. This is a simplification of the ideas used in David Lowe’s paper (and in general in applications using SIFT), but will serve to illustrate the basic ideas. Here is our problem statement:
Prior to beginning, we are presented with a set of <math> n </math> object with labels <math> \Omega = \{O_1, O_2, \ldots O_n\}. </math> From each of these objects, we have images <math>I_{i,j} </math> with <math>i=1,2\ldots n </math> and <math>j = 1,2,\ldots m. </math> We perform SIFT feature detection in each of these images yielding features sets <math>\Gamma = \{ F_{i,j} \}. </math>
In operation, we are presented with a new image <math>I^* </math> with features <math>F^*. </math>. In these images, we do not know the identity, position, orientation, or scale of any of the object(s) in question. Let <math>\mu </math> denote these four parameters. For now, we assume these is exactly one object in the image (we’ll return to generalizations to zero, one, or more objects in a moment). Then, our problem is to determine the pair <math> (O_i,\mu) </math> from <math>F^*. </math>
We can attack this problem in a couple of different ways. In the original work by David Lowe, a Hough transform was used. In this case, we can find one (or more) matches for each feature in our new image. Each of these features carries with it a location, scale, orientation, and object identity. We can use a discrete grid over this 5-d space, and look for large numbers of votes.
A second approach would be to use RANSAC. In this case, we could take pairs of matches arising from the same object and view. Each pair allows a computation of position, orientation, and scale using only the match location. We can then use this registration to map the entire feature set of that view into the image space, and see how many other features align with the image.
It is worth pointing out that there are several variations on this theme. For example, the limitation to rotation and scale means that our method with be sensitive to out-of-plane rotation. We will return to this idea later on once we’ve established a bit more background in projection geometry. We may also want to use some other similarity measure than SIFT features for our final decision.