## Segmentation and Grouping

One of the classic problems in computer vision is the association or grouping of image locations that share some common attribute. This process, usually referred to as *segmentation* or *grouping* depending on the structure of the underlying data. Segmentation usually refers to images, whereas grouping can be performed on any type of data. We will cover both in this unit.

To begin with, we will generalize on the notion of feature in the previous chapter. Our definition will be as follows. Let <math>F </math> denote an abstract *feature space* and let <math> \Pi </math> denote all (finite) bags of feature values taken from <math>F. </math> Let <math> \Zeta </math> denote the space of images, and consider a function <math>M: \Zeta \rightarrow \Pi. </math> We refer to <math>M </math> as a *feature transformation.*

At this level of generality, there are many interpretations of feature space and features. For example, we can allow <math>M </math> to be:

- the mapping of each pixel to its location and intensity — the feature space is the set of intensity values in the image (effectively the image pixels themselves)
- a Fourier transform — the feature space is now the FT of the image
- an edge detector — the feature space is now the space of all edge segments
- a histogram construction — the feature space is now the space of all histograms of color or intensity
- an operator that takes the local neighborhood of every pixel and creates a vector from it — the feature space is now the set of all templates in the image.

We can now impose a bit more structure on our feature space, and feature transformation (note: these are not standard definitions in the field, but rather my own way of thinking about this). To do so, we will impose some structure on the feature transformation. Let the images in question consist of <math>N </math> pixels. Let <math>R </math> denote a neighborhood in an image, where <math> |R| << N. </math> We refer to a feature transformation as *local* (and the resulting features as *local features* if we can write

<math>M(I) = \{m(I(R_1)), m(I(R_2)) \ldots m(I(R_k))\}, \ k\le N </math>

from some set of neighborhoods <math>R_i. </math> We refer to the feature transformation as *sparse* if <math>k << N, </math>*dense* if <math>k \approx N, </math> and *semi-dense* otherwise.

Thus, the set of edges in an image would be a sparse set of features and the set of all regions of size <math> l << N </math> would be dense. Semi-dense features are usually the result of sampling an otherwise dense set of features; for all practical purposes they can be treated as dense.

Another question is whether the feature values preserve their spatial proximity in the image. We refer to features as *spatial* if they carry their image coordinates with them; and *free* if not.

### A Warm-up: Basic Thresholding Methods

Basic thresholding and related image processing operations is a natural starting point in our discussion of segmentation. To be concrete, suppose we are interested in locating images of faces in images based on color. How might we do this? Well, suppose we are able to determine a specific hue <math> h </math> that corresponds to flesh tones. Given an image <math> I </math>, we can use the techniques described in the previous section to convert <math> I </math> to its HSI equivalent, and from that pull out the hue image, which we will call <math> H </math>.

Of course, we don’t expect the hue of very face to match exactly <math> h </math>, so we need to choose some value <math> \tau </math> that represents the variation we are willing to accept. With this, we can now compute a new binary image <math> B </math> by the following rule:

<math> B(l) = 1 if | H(l) – h | < \tau </math>

<math> B(l) = 0 </math> otherwise.

Note that we have accomplished something here — faces are now generally bright (i.e. their pixels have value 1), and other areas are dark. However, we have not yet identified and particular set of pixels as belonging together. To do this, we need to compute the “connected components* of the images. To do this, we need to define what we mean* by connected. Generally, there are two logical definitions in an image array: 4-connected and 8-connected. A neighborhood is *4-connected* if it connects to the pixels directly above, below, and to the sides. A neighborhood is *8-connected* if it is 4-connected and additional connects to the immediate diagonal pixels.

With this, we can write a simple connected components method as follows

int LabelComponent(location, currentRegion, binaryImage, labelImage)

count = 0; stack = EmptyStack; // This will hold our agenda of things to do stack.Push( Location );

while (!stack.Empty()) { loc = stack.Pop(); if (binaryImage(loc) = 1) labelImage(loc) = CurrentRegion; stack.Push( Neighborhood( loc ) ); count++; else labelImage(loc) = 0; }

return count;

int ConnectedComponents(binaryImage, labelImage, minRegionSize)

labelImage = -1; // This mean ``not yet visited currentRegion = 1;

locations = ImageLocations(binaryImage);

while ( !locations.Empty() ) { loc = locations.Pop(); if (labelImage(loc) < 0) labelImage(loc) = 0; if (binaryImage(loc) == 1) { n = LabelComponent(loc, currentRegion, binaryImage, labelImage); if (n < minRegionSize) n = LabelComponent(loc, 0, binaryImage, labelImage); else currentRegion++; } }

return currentRegion-1;

Notice that this labeling has now created a logical association among pixels that are connected with one another. So, for example, we can now answer questions such as:

- How many unique face candidates are there in the images?
- What is the centroid (the average x and y coordinate) of each face

candidate?

- What is the size (in pixels) of each candidate?

end so forth.

At the same time, note that there are several potential deficiencies in this method. For example, a small “gap” in a face may split it into two faces. Or, a tiny bridge of just one pixel can link two logical distinct face regions. Note also that it is quite possible to have a face with small “islands” of pixels that are just enough off-color to be erroneously discarded, even though their color is very close to their neighbors.

Unsurprisingly, we can remedy many of these problems. Indeed, we have already seen the idea of hysteresis thresholding in a previous chapter; a similar idea could be applied here to allow for a gradual change in color. We will leave that as an exercise (think about how you’d modify connected components to this.)

There are also a set of methods in the area of “image morphology that allow us to fill small gaps and remove small bridges. A brief introduction to binary morphology can be found here.

### Grouping Similar Regions

One thing that may have occurred to you is that we could have dispensed with the idea of having a fixed model for a face altogether. Rather, we could instead rely on the fact that a face has homogeneous color that is generally different than the colors around it. By grouping similar colors, we could achieve an “unsupervised segmentation — that is, without us picking a face model. In a later chapter on learning, we’ll talk more about supervised vs. unsupervised methods of classifying.

Since we now longer have an absolute value against which to measure similarity, we will have to now evaluate the similarity (or affinity) of pixels against each other. For two feature values <math>f </math> and <math>g </math>, an affinity is a mapping

<math>A(f, g) \mapsto \Re </math>

with the following properties:

- <math>A(f,f) = A(g,g) < \infty \ \forall f, g </math> (there is a unique largest affinity)
- <math>A(f,f) > A(f,g) \ \forall g \neq f </math> (a feature has maximal affinity for itself)
- <math>A(f,g) > 0 \ \forall f, g </math> (affinities are non-negative)
- <math>A(f,g) = A(f,g) </math> (affinities are symmetric)

One common similarity measure is based on a Gaussian function:

<math> a(p_1, p_2; \sigma) = \exp(1/2 (p_1 – p_2)^2/\sigma^2) </math>

We also state the following definitions:

A *grouping* of a set of features <math>\Pi </math> is a set of subsets <math>F_1, F_2, \ldots F_n \ n << S </math> that form a covering of <math>\Pi. </math> A *segmentation* requires that the covering be a partition.

The remaining question is how to relate affinities to groupings or segmentations. That is the topic of the remainder of this chapter.

#### K-Means and EM

At this point, it is worth stepping back from the pure image-based setting and thinking a bit more generally in terms of our feature space. We can define any particular set of affinities we like on feature values. For example, we could simply decide we want to measure Euclidean distance between feature values as a way of deciding similarity. The *K-Means* algorithm is a very well-known method of performing clustering on feature vectors using such a metric. The algorithm is described here .

What is interesting is that K-Means is trying to optimize a “global measure of similarity. However, K-Means makes a “hard” assignment between the data and a cluster. In some cases, this might not make sense. Imagine that we have a data value that is exactly in between two clusters — to which should we assign it? Now, suppose it is far far away from either center — how much should we pay attention to it?

The Expectation Maximization (EM) algorithm is a way to solve both of these problems. Rather than making a firm assignment of data to a cluster, it maintains a set of weights (like affinities) for each data item and each cluster. The resulting weights are used in the calculations of the cluster centers, and also the extent of the cluster.

The EM algorithm is described here .

#### Segmentation

Unfortunately, it is hard to write good segmentation algorithms on images using affinities and EM or K-means. If the feature space is free, then any grouping of pixels is unlikely to respect the spatial structure of images. If the features are spatial and the affinity function varies with those values, then clusters will tend to be biased to be spatially compact which may not be quite what we want. So, instead we will try to manipulate the spatial structure a bit more generally. Again, we’ll first start with a relatively simple algorithm, just to get some idea of how this might work. Here is a simple agglomerative segmentation algorithm:

<math> R’ = \{ < l, h(l) > | l \in Locations \} </math> <math> R = \emptyset </math>

While (R \neq R') { R = R' R' = <math> \emptyset </math> For each <math> <l, v> \in R </math> { <math> r* = max_{<v', l'> \in N(l)} a(v,v'; \sigma) </math> if <math> (a(v, v'; \sigma) > \tau) </math> <math> R' = R' \cup < l \cup l', (v + v')/2> </math> else <math> R' = <l, v> </math> }

}

Once you get past the notation, the idea is quite simple. We start with each pixel as a region. Each region gets to look at its neighbors and decide which neighbor looks most like it. If that neighbor is similar enough (note the threshold <math> \tau </math>), the regions are merged and the appearance values are updated (here we chose the simplest possible update; there are many other possibilities). Once we gone through the entire image, we check and see if this process changed any regions. If so, we repeat the process until no more changes occur.

A [http://people.cs.uchicago.edu/~pff/papers/seg-ijcv.pdf recent paper by Felzenszwalb and Huttenlocher] formalizes this idea nicely, and results in an efficient segmentation method.

## Energy Optimization-Based Approaches

All of the algorithms in the previous section are trying, in some sense, to choose a segmentation that maximizes affinity over a data set (e.g. an image). Let’s think about what that means mathematically for a moment, without worrying about how we might compute the results. To start with, let’s just posit that a segmentation <math> S </math> is some partition of <math> \Pi </math>. Let <math> \Kappa </math> represent all possible segmentations. This is quite a large set, of course. Absent any other constraints, for an <math> N </math> pixel image, we will have <math> O(N!) </math> possibilities. Some of these segmentations are rather ludicrous — for example it contains a segmentation where each pixel is its own unique region!