Background

Digital Imaging Basics

Before we launch into a discussion of computer vision, it is important to have at least a rudimentary understanding of the imaging devices we will be dealing with. In general, these devices are composed of (at least) two discrete components: the optical system and the image sensor. We defer discussion of optics-related issues to later, and focus here on the imaging sensor itself.

Imaging Electronics

The electrical imaging system of a camera is responsible for converting light energy to a digital representation of that energy. There are several aspects to the imaging process: the image sensing technology itself, the conversion to a digital bit stream, the protocol used for transfer, and the method by which color images are formed. We will touch on each of these briefly.

First, the basic unit of imaging is the “pixel” (for picture element). There are two major families of imaging elements based on the underlying circuit used: CCD and CMOS. The latter is also known as an “active-pixel sensor.” Although both CCD and CMOS sensors ultimately perform the same function, they have very different attributes. A detailed comparison of the two is found here.

Here are a few of the major points of comparison:

CCD sensors are usually read out using a destructive row-shift operation; CMOS sensors can be non-destructively read on an individual basis. As a result, CMOS sensors have much greater flexibility in terms of addressing and processing.
CMOS sensors are passive imagers and thus require far lower power than CCD (indeed, CCDS have quantum efficiencies reaching 70%); however CMOS is also generally noisier (or requires brighter light to function well).
The signal from a CCD chip is analog (voltage); CMOS produces a digital signal.
CMOS sensors are less subject to blooming.

The latter is worth talking about for a moment, as it is a significant artifact in digital imaging. Namely, a CCD sensor can be thought of as a electron “well.” As photons hit the detector, they are converted to electric charges and stored in this well. However, if the well fills up, it tends to “spill over” into adjacent pixels. As a result, a small very bright spot will be smeared out into a diffuse patch of brightness.

The output of a CCD sensor is an analog signal (the accumulated charges on each pixel in a row, read row by row). There is thus an A/D conversion that must take place before the signal can be read by a computer. Historically, this process took place on a dedicated board inside the host computer. However, almost all modern cameras now incorporate this process in the camera head itself, thus making almost all modern digital cameras “plug and play.”

Most modern cameras make use of either USB or IEEE-1394 (Firewire) protocols. USB 2.0 has a maximum bandwidth of 480Mb/sec, whereas IEEE-1394a is rated at 400Mb/sec and IEEE-1394b is rated at 800Mb/sec. All are capable of carrying a color image signal, provided that the resolution and frame rate of the camera are not too high. We will return to this issue later when we discuss image representations.

Resolution and Color Imaging

The resolution of CCD cameras has also evolved significantly over the past years. Originally, cameras were normed to television standards, which stipulated a size of 480 lines by 640 columns. A single image frame was composed of two fields consisting of alternating lines. These fields were refreshed at a rate of 60Hz, thus providing a complete images at 30 Hz (or to be precise, 29.97 Hz). However, modern cameras typically operate in progressive scan mode, and have resolutions up to 1200 by 1600 and beyond. Many cameras can provide images at several hundred frames per second.

Digital video cameras are now predominantly color. There are two general families of color imaging systems: 1 chip CCD and 3 chip CCD systems. One chip CCD systems are probably now the most common form of color imaging. They function by creating a mosaic of color tiles, called a Bayer pattern or Bayer filter. Each element of the mosaic is a 2×2 pattern of pixels, with two diagonal elements covered by a green filter, and the remaining elements covered by a red and a blue filter. If we think of each 2×2 grid as a “meta-pixel,” then clearly there is enough information to compute red, green and blue color components. In fact, what is done is to define filters that interpolate each of the color values onto every pixel. Typically, the resulting color data is encoded using a so-called YUV422 format (see the YUV color representation below) which provides greater spatial detail at the cost of color fidelity.

Three chip CCDs, as the name suggests, use a CCD for each color band. A special color separating prism is used to send separate color channels to each of these CCDs. Such systems have higher color fidelity and resolution, but are more difficult to manufacture and maintain as the 3CCDs must be accurately aligned. More information can be found here.

Color Representations

Technically, the following are color models, as opposed to (more complete) color spaces. An absolute color space requires the addition of a gamut, usually a factor of the display device, (such as an ICC profile) to map between a color model and absolute reference points. EasyRGB provides a nice color visualizer to get a better idea of how these models are related, as well as the underlying conversion formulas used.

RGB

The red-green-blue color space model is by far the most commonly-used because it maps directly to the display format used for most computer monitors and general digital displays. The corresponding digital image capture mechanisms also usually follow this division. The representation is based on the additive properties of light, where the image is divided into 3 channels formed by a basis of the primary colors of light, one channel each representing the red, green, and blue content of the image, giving rise to a 3-dimensional color space.

The resulting color for an individual pixel can be visualized as a unit cube. Black is at the origin (0,0,0), with white at the opposite corner (1,1,1). At the corners directly connected to the origin are pure red (1,0,0), green (0,1,0), and blue (0,0,1), with the remaining corners containing yellow (1,1,0), magenta/fuscia (1,0,1), and cyan/aqua (0,1,1). All intermediate colors are obtained through linear interpolation.

Details at Wikipedia

YUV/YIQ

The YUV and YIQ color models were primarily motivated by the conversion from black-and-white to color television, to separate the brightness information from the color content of images in such a way that older black-and-white televisions could still receive and display new color signals. It was also desirable for the transformation from the standard RGB color model be simple, for hardware implementation. The following linear transformations were settled upon:

<math> \begin{bmatrix} Y \\ U \\ V \\ I \\ Q \end{bmatrix} = \begin{bmatrix}

 0.299   &  0.587   &  0.114 \\
-0.14713 & -0.28886 &  0.436 \\
 0.615   & -0.51499 & -0.10001 \\
 0.595716 & -0.274453 & -0.321263 \\
 0.211456 & -0.522591 &  0.311135

\end{bmatrix} \begin{bmatrix} R \\ G \\ B \end{bmatrix} </math>

where Y encodes the brightness information according to human sensitivity to the various component wavelengths. Then the color information is encoded either in U and V or in I and Q, depending on the standard chosen. Although the UV/IQ channels have little intuitive interpretation, this color model remains a fast and simple way to separate brightness and color information.

Wikipedia: YUV / YIQ

CMYK

CMYK is not a space typically used in computer vision, but is worth mention because it may be commonly encountered. CMYK is a subtractive color model, where the combination of colors follows the behavior of combining pigments, rather than colored light. It is useful to think in terms of the set of colors that would be absorbed by a pigment with the given color. Adding together two pigments results in a combination of this absorbed light, such that the combination of all pigments results in black, as opposed to white in an additive color model. Correspondingly, the absence of any pigmentation produces white, instead of black in an additive color model.

CMYK uses 3 pigment colors, cyan, magenta and yellow. The final letter comes from the final “k” in black, whose use is best understood in the context of the application of this color model. CMYK is used for publishing, especially in color printing. Since the other 3 colors combine to black and black ink is cheaper than colored, it is useful to be able to decompose a color you’d like to represent to make as economical use of the colored inks as possible (though there are further motivations for using pure black ink as well).

Details at Wikipedia

rg Chromaticity

It is often useful to have representations of an object’s color that are invariant to changes in the intensity of light that illuminates the object. For instance, when tracking an object based on its color under artificial illumination, the overall intensity of the scene’s lighting can be expected to vary slightly from frame to frame. The introduction of shadowing will result in more dramatic effects which will changed the observed RGB values drastically.

rg chromaticity space is a simple modification of RGB to represent colors consistently across lighting intensity changes. Its elements are defined by

<math> r = \frac{R}{R+G+B}</math> and <math> g = \frac{G}{R+G+B}</math>,

with the lower-case letters “r” and “g” used to distinguish them from their uppercase counterparts in RGB. The implied third variable, <math> b = B/(R+G+B)</math>, can be omitted from the representation since <math>r+g+b=1</math>, so the blue portion of the color can also be recovered from just <math>r</math> and <math>g</math>.

The lighting model under which invariance is achieved assumes that changes in the lighting of an object will result in multiplication of its RGB values by a constant: For a (non-zero) constant, <math>c</math>,

As implied above, there is a singularity at the origin, where <math>0/0</math> is undefined. Consequently, close to the origin, very small changes in RGB values, such as those resulting from sensor noise, can produce large changes in rg values, so one should be careful when using rg on very dark regions.

HSV

Hue, Saturation, and Value define a color space very commonly used in color pickers in drawing programs, due to its perceptual intuitiveness and relative simplicity. Value represents the level of total illumination. Saturation can be thought of as the vividness of the color, ranging from an “unsaturated” gray-tone at zero to a “fully-saturated” color at its maximum. Hue, finally, represents the underlying color (e.g. red, orange, yellow) whose saturation and brightness are modulated by the other variables. Hue is often interpreted as an angle, as its maximum and minimum values are perceptually nearly identical. Its range, however, is variously defined as zero to one, 360, 2<math>\pi</math>, 255, or other values depending on the convention. (1)

There is no simple formula for conversion from RGB to HSV, primarily due to the hue component.

<math>H</math> depends on which of <math>R</math>, <math>G</math>, or <math>B</math> is maximum:

<math> H = \begin{cases} 0, & \mbox{if } \max = \min \\ (60^\circ \times \frac{G – B}{\max – \min} + 360^\circ)\;\bmod\;360^\circ, & \mbox{if } \max = R \\ 60^\circ \times \frac{B – R}{\max – \min} + 120^\circ, & \mbox{if } \max = G \\ 60^\circ \times \frac{R – G}{\max – \min} + 240^\circ, & \mbox{if } \max = B \end{cases} </math>(2)

Hue is not defined when <math>\max = \min</math>, i.e. for shades of gray, so it is assigned an arbitrary value by convention. In this case, that value is 0, which corresponds to red.

The space is best visualized as a cone, with the central axis corresponding to value, the radius along a circular cross-section to saturation, and the angle around that circle to hue. Cylindrical visualizations are more common, but the conical representation better displays the fact that the number of representable colors decreases with the value.

The <math>H</math> and <math>S</math> components can be used alone to achieve a degree of illumination-invariance, as with rg, but the nonlinear definition of the <math>V</math> component translates less directly to real illumination changes. Additionally, although distances taken within the space represented as a cone have fairly perceptually meaningful values, the <math>H</math>, <math>S</math>, and <math>V</math> components are not suitable for direct use with Euclidean distance. First, they do not take into account the narrowing effect at the point of the cone. Second, the hue component should wrap around on itself, such that a maximum value should have close to a zero distance from a minimum value (both are red).

Details at Wikipedia

Lab/Luv

The Commission Internationale d’Eclairage (CIE, Fr: International Commission on Illumination) simultaneously introduced two slightly different color spaces in 1976, Lab and Luv, expressly designed so that Euclidean distances within these spaces corresponds well with human perceptions of differences between colors. These are full color spaces (as opposed to color models), and as such they can not be directly converted from RGB without absolute references for the primary color components. Lab and Luv (also known as CIELAB and CIELUV) were originally defined with reference to an older CIE color space, XYZ, and most definitions still use it as an intermediary for conversion from other color spaces.

In both these models, the <math>L*</math> component represents lightness, the perceived brightness of a color, and the remaining two components (<math>a*</math> and <math>b*</math> in Lab or <math>u*</math> and <math>v*</math> in Luv) describe the chromaticity. At the time of introduction, there was not consensus within the commission behind a single standard. Lab seems to be the more popular choice now, but the two have very similar properties. Both involve nonlinear transformations from RGB, so they are not commonly used unless perceptual similarity is the main focus.

Wikipedia details on CIELAB and CIELUV

Image Formats

A number of formats for storing digital images have been developed over the years. What follows is a sampling of the most common ones. For the purposes of computer vision, it is important to realize that many common formats, e.g. jpeg, are “lossyformats, which introduce artifacts into images. This is to be avoided if at all possible for the purposes of computer vision.

BMP

Of the common formats, the bitmap file format, which uses the file extension “BMP” follows most directly from the image capture mechanisms of typical digital (CCD or CMOS) cameras. The image is divided into pixels which describe the color of a region with a number of bits depending on the color depth being used. Typical choices are 24- and 32-bit color depths, with 8 bits each being used to describe the red, green, and blue content of the pixels, and an optional additional 8 bits to describe “alpha”, the transparency.

As with all of these formats, the file consists of a header followed by the main data in a block. Some bitmaps may be color-indexed, in which case a color palette lies at the end of the header. In color-indexed images, the palette lists all of the colors which are represented in the image. Then, each pixel’s color is described by a code referencing one of the colors in the palette, rather than by explicitly describing the color. Most bitmaps are not indexed, though, and indexing does not provide very much compression on natural images, so they tend to be quite large as a result. They are, however, a lossless format, containing essentially raw data, and are therefore relatively easy to read and write.

Details at Wikipedia. Also see the PNM format for even easier file reading and writing.

GIF

The Graphics Interchange Format (.GIF) uses a color indexing scheme similar to that optionally provided by BMPs. An individual GIF is capable of representing up to 256 individual colors (including transparency information), so conversion to this format will require a loss of information if the original image contains more colors than that. The data is further compressed with lossless LZW encoding.

The most distinguishing characteristic of GIF is its support for animation. Several images may be combined into an individual GIF file, and they will be displayed in flip-book fashion at a specified rate, allowing the creation of simple videos. No compression is performed based on the information repeated between frames, however, so animated GIFs are generally much larger than videos encoded using standard video codecs, which make heavy use of such information.

Details at Wikipedia

PNG

The Portable Network Graphics format (.PNG) was originally developed as an alternative to GIF, due to patent issues which have since expired. PNG has become quite different from GIF, though, and has largely displaced it for usage. PNG’s main features are transparency and a lossless compression scheme that works well on real images.

Unlike GIF, PNG does not support animation. Although a couple ventures have attempted to add such support as an extension or derived file type, they have not yet garnered wide support or usage.

Details at Wikipedia

JPEG

The Joint Photographic Experts Group came up with the now-ubiquitous .JPG/.JPEG standard. The most important thing to be aware of about JPEG is that it is a lossy compression format. It supports several levels of compression, with increasing levels of information loss that result in artifacts sometimes known as the “jaggies”.

JPEG (usually) makes use of the discrete cosine transform (DCT) to perform this compression, resulting in a spatial frequency-domain representation of the image. Information can then be selectively discarded, starting with high-frequency information that tends to be least perceptually significant, particularly in very large images. JPEG also supports the option of using a version of Huffman coding for the compression.

Details at Wikipedia

TIFF

The “Tagged Image File Format” uses the file extension .TIFF or .TIF and was originally introduced to support publishing. It was similar to bitmaps when first introduced, but supported a form of compression known as bit-packing (a form of run-length encoding). In this scheme, the image is considered one long stream of pixels, and consecutive pixels with the same color are encoded as the number of times that color is repeated. This can greatly reduce the size of files with uniform regions, but is less effective on natural images. More recently, the format has been extended to support LZW or JPEG compression as alternatives as well. It is important to note, however, that even when using JPEG compression, TIFF is a lossless format, and can be saved and modified without losing information. Because of the complexity of the TIFF specification and the wide variety of (e.g. compression) options, TIFF is a diverse standard, but many TIFF viewers only support a subset of the full standard.

Details at Wikipedia

Light Transport

Although we do not intend to spend a great deal of time on the physical properties of image formation, there are a few concepts that are useful to have in mind as we progress through our study of computer vision. A fundamental concept for understanding how light is reflected from surfaces is the Bidirectional Refectance Distribution Function (BRDF). We can write this as

Here <math> x </math> is a point on the surface of interest with normal <math> n(x).</math> <math> n_i </math> and <math> n_o </math> are unit vectors indicating input and output directions.

This is a ratio between the incident irradiance on the surface, and the reflected radiance due to that irradiance. More details can be found at Wikepedia.

There are several special cases of the BRDF that are used in computer vision and graphics. A particularly important one is theLambertian model, where it is assumed that reflected light is scattered equally in all directions. If we multiply through and include a surface albedo term <math> \rho </math> to account for attenuation of the light due to pigmentation of the surface, we have

This simplifies further if we assume a single point source with intensity <math> L_s </math> in direction <math> n_s </math> in which case we have

At this point, it is worth noting that if we now observe the surface when illuminated by three independent sources, it is possible to calculate both surface normal and albedo from the image information.