Image Processing

The first steps in almost any computer vision algorithm usually involve some transformation of the raw image pixels to enhance certain desirable properties, and to suppress undesirable properties. This is referred to as image “filtering. This chapter describes several of the more common filtering operations used in computer vision.

Linear Filtering

To begin, consider a single pixel in an image. In this chapter and in all that follows, we will use <math> I(u,v,t) </math> to denote the value of an image at location <math> (u,v) </math> at the time <math> t </math>. When <math>t </math> is not essential to the discussion, it will be suppressed. The indices <math> (u,v) </math> are to be interpreted as an index into the image array. We will use C language conventions, and so <math> (0,0) </math> will be the upper left corner of the image. Unless otherwise stated, it is assumed that the value of a pixel is scalar (obviously, this is not the case if we are dealing with color images, but that will be the exception).

It is important to realize the the value that is stored in the image array does not necessarily exactly reflect the intensity of light that hit the sensor. Even absent such obvious irregularities such as the blooming described earlier, there are random and systematic variations in the signal. Thus, if we were to model what is going on, we could write

where <math>I* </math> is some ideal signal, and <math>r </math> is a random process. For the purposes of the present discussion, we will assume that <math>r </math> as the following properties:

<math>r(u,v,t)</math> and <math>r(u’,v’,t’) </math> are independent random variables so long as at least one of the arguments differ.
<math>r(u,v,t)</math> and <math>r(u’,v’,t’) </math> are identically distributed.
<math>r(u,v,t) </math> follows a Gaussian distribution with zero mean and variance <math>\sigma^2 </math>for all values of <math>u </math>, <math>v </math>, and <math>t. </math>

For the moment, let us further assume that <math>I* </math> is stationary (that is, unchanging) with respect to time. It follows from basic statistics that we could reduce the effect of noise (i.e. of the random signal <math>r </math>) by averaging. In fact, we if we compute

then it follows that the variance of <math>\hat{I} </math> is <math> \sigma^2/(2n+1). </math> In short, in principle, we could get as close to the ideal signal as we like, simply by averaging values. Note, by the way, that this often true even if the noise is not Gaussian, as we have assumed.

Unfortunately, it is not usually the case that the world is willing to stand still while we gather several observations of the same signal. However, we could instead assume that nearby pixels have about the same ideal value, and so instead we could average over space. That is, we could instead compute

What is the resulting variance in this case?

Of course, the idea that all of the ideal values for pixels near a given pixel are exactly the same is not really reasonable; indeed if it were exactly the case, then the only images possible would be monotone! However, note that there are spatially non-stationary processes for which this would still be a reasonable strategy. For example, suppose the values of <math>I </math> were varying linearly with <math>u </math> and <math>v. </math> Indeed, one might argue that, in general, for small neighborhoods a linear variation is likely to be a good approximation. So, maybe this is not such a bad idea after all!

Another idea that might make things even better would be to perhaps weight the pixels nearer to the central pixel (i.e, the one of interest) more highly than those further away. That is, we might decide to compute

Here, we have made several changes. First, we are no longer interested in time-variation, so we have dropped <math>t </math>. Second, we have redone the indexing in terms of variables <math>n </math> and <math>m </math>. You might want to verify, though, that what we have written is exactly equivalent to our previous expression, provided we set

The operation we have defined above is so fundamental that it has its own name: convolution and its own operator, *. Thus we define now

<math> (I*K)(u,v) = \sum_{n = – \infty}^{\infty} \sum_{m = -\infty}^{\infty} K(n, m) I(u-n,v-m) </math>

in the discrete domain and

<math> (I*K)(u,v) = \int_{ – \infty}^{\infty} \int_{-\infty}^{\infty} K(n, m) I(u-n,v-m) dm dn </math>

in the continuous domain. We henceforth refer to <math>K </math> as the kernel of the convolution.

Note we have extended the summation (respectively integration) to the entire integer (respectively real) line. We have done so with the understanding that we can just put in zeros if we have a kernel over a finite range, as we had previously.

Usually, we consider the resulting of convolving entire images by some kernel, so we will often write, for example <math> H = I*K </math> to denote the convolution of the image <math>I </math> by the kernel <math>K. </math> Convolution has several useful properties that we state here (and which you may wish to verify for yourself):

Convolution is commutative: <math> I*K = K*I </math> (Note this also implicitly assumes that the image in question has infinite support).
Convolution is associative: <math> I * (K * J) = (I * K) * J. </math>
Convolution is a linear operator: <math> K * (\alpha I + \beta J) = \alpha K*I + \beta K*J. </math>

Another useful fact is the following: the result of convolving an impulse with a kernel is the kernel itself. That is, if <math> U(0,0) = 1</math> and <math>U(u,v) = 0 </math> otherwise, then

This is a useful property, since it means that if someone just hands you a convolution operation as a black box, you can discover the form of the convolution just by convolving it with an impulse. Indeed, this is call the impulse response of the filter. Furthermore, note that there was nothing special about placing the impulse at zero. If we, for example, move the impulse one location to the left, all we do is to also move the response one location to the left. However, we haven’t otherwise changed the response. Thus, convolution is also translation invariant.

In fact, suppose now we do the following. Let’s take the image <math>I </math> and create a set of “impulse images” <math>U_{u,v}</math> where <math>U_{u,v} </math> puts a value of 1 at location <math>(u,v) </math> and zero otherwise. Then it is not hard to show (using the linearity of convolution) that

In short, the result of convolution can also be viewed as the summation of the impulse responses at each image location scaled by the value at that location.

With this as a basis, let’s now consider of what we can do with convolution. First, it is important to point out that if we want to`compute’ a convolution, we will generally only care about the non-zero values of the kernel. Second, for convenience, we will usually assume the width and height of the convolution kernel are odd. However, there are still some issues to consider. In particular, our images will always be of finite size. As a result, if we consider a kernel of size <math>2n+1 </math> by <math>2m+1 </math>, there will always be <math>2n rows </math> and <math>2m </math> columns on the border of the image for which the value of the convolution is not well-defined. In general, there are three possibilities:

Use only those values which have well-defined values. As a result, images will shrink in size as we apply convolutions to them.
Pad the outside of the image with enough zeros (n rows top and bottom and m columns left and right) so that the result of the convolution is the same size as the image we started with. This is often convenient, but we must realize the values on the boundary are suspect.
Pad the outside of the image with enough zeros (2n rows top and bottom and 2m columns left and right) so that anyboundary pixel that would have elicits some response to the image is computed.

In Matlab, convolution is implemented using the conv2 operator. It accepts an argument indicating what type of padding to do. These are valid, same, and full, respectively.

Models for Convolution

There are many ways to view the effect of a convolution operation on an image. We have already introduced one: convolution can be viewed as a way of averaging the image intensity values in a neighborhood.

Here is another idea. Note that we can view the value produced by a particular kernel applied at a given location as the inner product between two vectors: One vector is the kernel itself; the second vector are the intensity values in the neighborhood about the location in question flipped about both axes. Why the latter? Have a close look at the convolution formula and see if you can figure out why? Even better, if you figure it out, then try to see why its important to have the formula the way it is (hint, think about some of the properties of convolution we’ve discussed along the way).

In any case, suppose these two vectors are written as <math>k </math> for the values of the kernel, and <math>i_{u,v} </math> for the intensity values about the location <math>(u,v). </math> Then we can rewrite convolution as

Now, let us write <math>\vec{i}(u,v) = i(u,v)/\|i(u,v)\| </math> and <math> \vec{k} = k/\|k\|. </math> That is, anything with an arrow on top is a unit vector. Plugging this in, we have

<math>I*K(u,v) \propto \|i(u,v) (\vec{k} \cdot \vec{i}(u,v)) </math>

Here, since <math>k </math> is fixed, we have suppressed its magnitude (this will just rescale the result of the convolution).

As a result, it is clear that the response of a convolution depends on two factors:

The magnitude image values in the region of interest, and
The inner product between the image and kernel considered as unit vectors

The latter will of course achieve a maximum of 1 only if <math>\vec{k} = \vec{i}(u,v).</math> In short, this term is large when the image vector closely resembles the convolution itself. Thus, convolution can be considered as a type of pattern matchingoperation.

In order to understand our final interpretation of convolution, we need to delve a bit deeper into signal processing. At some point, you have probably been exposed to the idea of approximating a function (e.g. a two-dimensional signal like an image) using some type of basis. For example, one can approximate a scalar function using the well-known Taylor series expansion.

In signal processing, it is useful to think about decomposing a signal in terms of its frequency structure. We will represent this by using sine functions as approximation basis — that is, we will try to express an image as a sum of just the right combinations of sine waves. For a scalar function (let’s keep it simple for now), we could write

<math>f(x) = \sum_{i=1}^{\infty} \alpha_i sin(\beta_i x + \gamma_i) </math>

for the right choices of <math>\alpha_i, </math> <math>\beta_i </math> and <math>\gamma_i. </math> Indeed, one can show that for almost all reasonable functions, there is a choice which, in the limit, converges exactly to <math>f. </math>

The method of finding these coefficients is the Fourier transform (continuous case) and the Discrete Fourier transform (discrete case). Both make use of the following trick. We can write sines and cosines using [http://en.wikipedia.org/wiki/Euler%27s_formula Euler’s formula] which states that

The continuous Fourier transform in one dimension is then defined as

<math>F(\eta) = \int_{-\infty}^{\infty} f(x) e^{-2\pi i x \eta} d x </math>

It closely resembles the inverse transform which is given by

<math>f(x) = \int_{-\infty}^{\infty} F(\eta) e^{2\pi i x \eta} d \eta </math>

If we expand the Fourier transform using Euler’s formula, we have

<math>F(\eta) = \int_{-\infty}^{\infty} f(x) \cos(-2\pi x \eta) + i

f(x) \sin(-2\pi x \eta)  d x </math>

Note that this looks much like our convolution operator above. If we rely on the intuition we developed there, we can see the following general properties:

The Fourier transform of a real function is a complex function.
The value of <math>\eta </math> governs the frequency of the underlying sine and cosine functions.
Based on our previous discussion of convolution, we see that the response of each of the terms will be proportional to the similarity of the sin or cosine function at a given frequency.
Because sin and cosine are <math>\pi/2 </math> radians out of phase, the difference in the components of the response will tell us something about the phase (<math>previously, our \gamma_i) </math> at which the signal responds.

Given the Fourier representation of signal, there are two common quantities of interest. The first is the power spectrum, defined as

Intuitively, the power at a given value frequency <math>\eta </math> is the relative strength of that frequency in the signal,independent of phase. The second quantity is the phase spectrum given by

<math> \angle F(\eta) = \pm \arctan (Im(F(\eta))/Re(F(\eta)) </math>

An intuitive interpretation of phase is much harder to come by. Suffice to say that we can interpret this quantity as indicatingaverage phase at a given frequency. It is interesting to note that, although the power spectrum is much easier to understand intuitively, there is strong evidence that the phase spectrum of an image actually carries much more of the content we interpret as form.

To extend these ideas to images, we make two steps at once. First, we move from the continuous domain to the discrete. Here, it is useful to note that there is a maximum frequency that any discretely sampled signal can represent (the so-called Nyquist frequency), and so the values of <math>\eta </math> will be both bounded and also themselves discrete. Second, we extend the DFT to two dimensions. The result is:

<math> DF(\eta, \gamma) = \sum_{u=0}^{N-1} \sum_{v=0}{N-1} I(u,v) e^{-2\pi i (u \eta + v \gamma)}

</math>

and the inverse is <math> I(u, v) = \frac{1}{N^2} \sum_{\eta=0}^{N-1} \sum_{\gamma=0}{N-1} DF(\eta,\gamma) e^{-2\pi i (u \eta + v \gamma)}

</math>

The easiest way to visual this is to think of <math>(\eta, \gamma) </math> as a vector in the plane, the direction of which defines the direction of the periodic function, and the magnitude of which defines the frequency (phase is, as usual, encoded in the resulting complex value). Thus, by sweeping out the entire plane, we cover frequencies from 0 (D.C.) to the maximum possible and also all possible directions.

In earlier days, the DFT was often used as a way of characterizing texture. Indeed, it is quite interesting to see what the DFT of many common scenes is. We do so in the accompanying notes.

So, why the huge detour — what does this have to do with convolution? Well, there is a particularly elegant statement we can now make. Namely, given two signals <math>f </math> and <math>g </math> (scalar sound recordings, two-dimensional images and kernals, or whatever), their convolution can be expressed as follows:

<math> \mathcal{F}\{f * g\} = \mathcal{F}\{f\}\cdot \mathcal{F}\{g\}</math>

In short, the Fourier transform of the convolution of two signals is equal to the product of their Fourier transform. If we consider the power spectrum, we see that convolving a signal <math>f </math> with <math>g </math> accentuates or suppresses the signal frequencies in <math>f </math> corresponding to high or low values of frequencies in the filter <math>g </math>.

Details at Wikipedia and Math World.

Kernels for Smoothing

Given what we have established above, we can now think about what types of kernels would be most appropriate for smoothing. In our introduction, we have seen that one possible kernel is one with constant values, which corresponds to computing a moving average of the image. This is also referred to as the box filter (from the fact that when you plot the function in 1-D it looks like a box.

Averaging with a box filter takes out small perturbations by averaging across a pixel with all the neighborhood around it. However, if we consider the FT of the box function, it is possible to show it is the sync function

This implies that smoothing with a box filter has the undesirable effect of accentuating a series of higher order frequencies (this is referred to as ringing.). In short, some higher order frequencies of our noise will also creep into the signal.

Gaussian smoothing is a type of averaging which uses a 2D Gaussian as its kernel. The 2D Gaussian is given by

<math>G(x,y; \sigma) = \frac{1}{2 \pi \sigma} e^{- (x^2 + y^2)/\sigma^2} </math>

The Fourier transform of a Gaussian is still a Gaussian and has no secondary lobes. Thus, Gaussian kernels are better low-pass filters than the averaging filter.

Computational Issues

The complexity of computing the convolution of an N by M image with a kernel of n by m is O(NMnm). For example, on 1000 by 1000 image, computing a convolution with an 11 by 11 kernel would require 121,000,000 multiplies and an equal number of additions. Thus, convolution is expensive and thus it is important to achieve time savings when possible. Here, we note two important “tricks.”

Separability of Kernels

If we return to our simple averaging kernel, we note that we can play the following “game:”

<math> (I*B_{h,w})(u,v) = \sum_{i=-h}^{h} \sum_{j=-w}^{w} B(i,j) I(u-i,v-j) = 1/(2h+1) \sum_i (1/(2w+1) \sum_j I(u-i,v-j) ) = (I*B_{1,w})*B_{h,1}.

</math>

That is, rather than performing one convolution with <math>(2w+1)(2h+1) </math> elements, we instead perform two smaller convolutions of size <math>2w+1 </math> and <math>2h+1 </math>. The complexity is thus now O(NM (2w+1 + 2h+1)). We can also prove this using the properties of convolution, namely

by the associativity of convolution. But, it is easy to show that <math>B_{1,w}*B_{h,1} </math> results in exactly <math>B_{h,w} </math> and so <math> I*(B_{1,w}*B_{h,1}) = I*B_{h,w} </math> which is what we started with to begin.

This so-called separability property cannot always be used. However, it is not hard to show that the widely used Gaussian filter is also separable.

Collapsing Convolutions vs. Not

The previous example illustrates a second useful computational optimization of convolution. Namely, suppose we have a large image I of size N by M and smaller kernels J and K, of size 2n+1 by 2n+1 and 2m+1 by 2m+1, respectively. Then if we compute

the complexity is O(NM((2n+1)^2 + (2m+1)^2)).

But, if we instead refactor and compute

the complexity is <math>O(NM(2n+2m+1)^2 + ((2n+1)(2m+1))^2). </math>

For example, for a 1K by 1K image and two 11 by 11 kernels, the difference is approx 500M floating point operations <math> (2*2*11^2*10^6) </math> versus nearly 1Gflops <math> (2*21^2*10^6) </math> for the collapsed version.

Derivative Kernels

Until now we have considered only kernels that perform an averaging operation. That is, kernels that have only positive values and sum to 1. However, we can easily design kernels that perform other functions.

In particular, consider the following simple kernel: [1, -1]. If we apply this to a scalar signal, e.g. [1,2,3,4,3,2,1], a little hand work (or a Matlab call) shows that the result is [1,1,1,-1,-1,-1]. In short, this has computed the first differences of the signal. Equivalently, it is a local approximation to the slope of the signal. For this reason, this kernel is referred to as a derivative filter.

If we consider a 2D image, we see that there are really two kernels to consider: <math>D_u =[1, -1] </math> (derivatives along a row) and <math> D_v = [1; -1] </math> (derivatives along a column). This makes sense since, if we think of an image as abrightness surface, then we consider the 2-dimensional gradient vector of the surface as characterizing the locate slope of the surface.

There are several possible derivative operators. A two of the most common are given below:

Sobel: [1 0; 0 -1] and [0 1; -1 0]

Prewitt: [1 0 -1; 1 0 -1; 1 0 -1] and [1 1 1; 0 0 0; -1 -1 -1]

One important thing to realize, however, is that derivative filters are high pass filters. This is easy to see if you take the FT of a derivative filter. Also, if you consider a slowly varying signal (i.e. low frequency), the derivatives are small, whereas a rapidly varying signal produces large magnitude derivatives.

The problem is that image noise is also concentrated in the high frequencies, so taking derivatives tends to be noise amplifying. As a consequence, it is often desirable to reduce the effect of noise when taking derivatives. Thus, a common approach is to combine smoothing with derivatives. Thus, we might for example compute:

<math>g_u = I * G(\sigma) * D_u </math> <math>g_v = I * G(\sigma) * D_v </math>

Again, using convolution arithmetic, we can combine the convolutions and write <math>GD(\sigma)_u = G(\sigma) * D_u </math> and <math>GD(\sigma)_v = G(\sigma) * D_v. </math> These two kernels, the derivative of Gaussian kernels, are quite common. Note that they each have a free parameter, <math>\sigma </math> which is referred to as the scale of the operator. The name fits, since the choice of <math> \sigma </math> directly determines what scale of change the operator responds to. That is, very high frequency (i.e. rapidly changing) image structure will only be detected if a small value of <math>\sigma </math> is used; a large value will suppress it. Indeed, it is quite common to perform this convolution at several scales, leading to the notion of a scale space of images.

Given the gradient value <math> \nabla I(u,v) </math>, we can now compute three useful scalar quantities:

The gradient magnitude: <math> \| \nabla I(u,v) \| </math>
The gradient direction: <math> atan2(I_u(u,v),I_v(u,v)) </math>
The directional derivative: <math> n \cdot \nabla I(u,v) </math> where <math> n </math> is the desired direction.

The latter is an example of a steerable filter — that is a filter for which the directional version can be produced through a linear combination of some fixed components.

While we are talking about derivatives, it is worth noting we can go further, for example, we could compute:

<math>h_{uu} = I * GD_u(\sigma) * GD_u(\sigma) </math> <math>h_{uv} = I * GD_u(\sigma) * GD_v(\sigma) </math> <math>h_{vv} = I * GD_v(\sigma) * GD_v(\sigma) </math>

Note <math>h_{vu} = h_{vu} </math> (can you prove this?) so there is no need to compute it.

We have just computed the local ‘Hessian (or second derivative) of the image. Notice we did it simply by taking the derivative of the derivative — i.e. by applying the derivative convolution twice. Of course, we could also have combined the two convolutions into a single kernel, too.

An second derivative operator that is sometimes used is the Laplacian of a Gaussian. The Laplacian of a function <math>f(x,y) </math> is:

<math>L(f) = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} </math>

Of course, we can mirror this exactly; we just compute for any smoothing kernel G (e.g. a Gaussian)

If we now compute <math>I*L(G), </math> it is not hard to show we have computed a discrete approximation to the Laplacian. Also, just as with Gaussian-smoothed derivatives, we can compute a Laplacian scale space.

Texture

Two general types of visual phenomena are often referred to as texture, those consisting of regular, repeating patterns, as well as patterns that are consistent, but more stochastic in nature. From these different schools of thought about texture emerge different widely-used descriptors. Though each targets a particular type of texture, they have varying degrees of success in distinguishing other types of texture as well. All of these methods were developed for application to grey-scale images, but can also be used on the various channels of any desired color space.

Gabor filters take the approach of frequency analysis to characterize textures. They are a set of image filters consisting of the convolution of a Gaussian with even and odd sinusoids of various frequencies, scales, and orientations. The response is intended to be strong in the presence of repeating patterns with similar parameters. These filters have the advantage of characterizing orientation, allowing one to design a filter bank that produces a feature vector which is invariant to, or at least normalized by rotation. They are also efficiently computable in batches. Though they perform well on repeating patterns, though, they are less able to characterize stochastic textures. Matlab code for generating log-Gabor filters is available from Peter Kovesi.

Another very popular set of filters was developed by Leung and Malik and is known as the LM filter-bank (Leung & Malik, 2001). It consists of a set of edge and bar filters at 3 scales and 6 orientations, as well as several Gaussian and Laplacian of Gaussian filters. Schmid introduced a set of rotationally-invariant filters in Schmid, 2001, consisting of Gaussians modulated by cosine functions of varying frequencies in the radial direction. Varma and Zisserman introduced another set of filters, called maximum response (MR) filters, based on LM, but modified to add some invariance to rotations to the response. They record only the maximum response of the oriented filters at each scale, and only use one scale of Gaussian and LoG, for a total of 8 features. Illustrations of the LM, Schmid, and MR8 filter banks, as well as Matlab code for generating them, are available from the Oxford Visual Geometry Group page on texture classification.

The most commonly-used statistical texture descriptors are Haralick features, also known as grey-level co-occurrence matrices (GLCMs), or simply co-occurrence matrices. They consist, as their name suggests, of 2-dimensional histograms of the occurrence of one image intensity value (or range of values) with respect to another, at a given relative position, such as comparing each pixel with the pixel immediately to its right. In order to capture information about texture at different orientations or ranges, multiple matrices are usually constructed, each with a different relative position, and the appropriate resolution is largely heuristic. The matrices themselves are not generally used as a descriptor, though. Instead, a common set of several features of the distributions have been derived. One typically chooses some subset of these features, based on the usual trade-off of discriptiveness and speed of computation. The most commonly used features correspond to concepts of energy, correlation, contrast, variance, etc. These features are designed for and perform well on stochastic textures. They tend to have good performance on regular repeating patters as well, though, and are available in Matlab via the functions graycomatrix andgraycoprops.

Finally, in 2003, Varma and Zisserman showed surprisingly good performance using image patches directly as feature vectors, rather than the output of filter-banks. They demonstrate classification accuracy surpassing LM, Schmid, and MR filters when using image patches of the same or smaller support size in the same (albeit non-trivial) classification algorithm.

Nonlinear Filters

De-noising is a common filtering task since many forms of data corruption can be viewed as noise applied to a signal. Though some types of noise are handled well by smoothing (e.g. zero-mean uniformly- or Gaussian-distributed additive random noise), other common corruption effect are non-linear, and correspondingly non-linear filters are needed to reasonably mitigate their effects.

Here are a couple examples:

Example of Denoising with Gaussian noise
Original		Gaussian-Corrupted
Mean-filtered	Gaussian-filtered	Median-filtered

Example of Denoising with Salt & Pepper noise
Original		Salt&Pepper-Corrupted
Mean-filtered	Gaussian-filtered	Median-filtered

With the image pixel values scaled 0 to 1, the Gaussian-corrupted image was subjected to Gaussian-distributed noise of magnitude 0.1 with standard deviation 0.01, independently at each pixel. The salt and pepper noise had 5% of pixels randomly switched to 0 or 1. In both cases, Gaussian filtering was accomplished with a 3×3 filter with standard deviation 0.5, and mean- and median-filtering with 3×3 averaging and median filters respectively.

Median Filtering

Median filtering is particularly simple-to-implement form of non-linear filter where the value of each pixel in an image is replaced with the median of those in a neighborhood around that pixel. It is particularly well-adapted to dealing with salt and pepper orshot noise, though it can also be applied to additive noise.

Wikipedia

Morphology

Wikipedia

Top-hat/Bottom-hat

The top/bottom-hat operators refer to a couple separate filters in different domains.

In signal processing, the top-hat function is 1 in a small neighborhood about the origin and 0 elsewhere. When applied in the spatial domain, it acts as a mean filter. If instead applied in the frequency domain (ie, Fourier domain), it acts as a band-pass filter. The bottom-hat filter is essentially the inverse, one minus the top-hat filter. In the frequency domain, it suppresses a band of frequencies, and it is not generally used in the spatial domain.

Wikipedia

In morphology, however, the top-hat operator is defined as the difference between the input signal and the result of an open operation on it. Similarly, the bottom-hat is the difference between the input and the result of a close operation.

Wikipedia