Two Camera Geometry
We begin with a brief discussion of the geometry of two cameras. By careful construction, it is possible to make use of two-camera geometry to develop efficient algorithms.
Two Camera Geometry from Calibration
In the last chapter, we saw how to calibrate a camera using a special target. The result of the calibration process includes both the intrinsic parameters of the camera (pixel scale, image center, and distortion coefficients) as well as the extrinsic parameters (the location of the camera relative to the external target).
Let us denote the transformation from coordinates of the external reference frame <math>F </math> to the camera <math>c1 </math> by the homogeneous transform <math> \ ^{c1} H_F</math>. Suppose now that we have a second camera <math>c2 </math>. If we simultaneously observe the same target with both camera during calibration, we will also compute a corresponding <math>\ ^{c2} H_F </math>. By transform algebra, we can thus compute the relationship between the two cameras as
<math> \ ^{c1} H_{c2} = \ ^{c1} H_F (\ ^{c2} H_F)^{-1} </math>
This allows us to relate points in the second camera frame to the first.
With two cameras, it is now possible to perform triangulation to compute the coordinates of points in the image. We will first consider a special case, a non-verged camera system. In this case, the spatial transform relating the two cameras consists of a translation of a distance <math> b </math> (the baseline) along the camera <math>x </math> axis. If we consider now a point <math>\ ^{c1}p = (x,y,z)^t </math>, the equivalent point in <math>c2 </math> coordinates is <math>\ ^{c2}p = (x – b, y, z)^t</math>. Consider now the projections of these two points for a camera with focal length <math>f </math>:
<math>\begin{bmatrix} u_1 \\ v_1 \\ u_2 \\ v_2
\end{bmatrix} =
\begin{bmatrix} f x/z \\ f y/z \\ f (x-b)/z \\ f y/z \end{bmatrix}
</math>
From this, we see the two <math>v </math> coordinates are the same, but the <math> u </math> coordinates differ. If we take their difference, we have
<math> d = u_1 – u_2 = f b/z \Rightarrow z = b f/d </math>
Given a value for <math>z, </math> we can plug back into the original perspective equations to get <math>x </math> and <math>y </math>.
This same idea can be extended to general transformations between the cameras. We leave it as an exercise to compute this relationship.
Two Camera Geometry from Image Information
In the previous example, we noted that the <math> v </math> values of the projected points were identical. This was an artifact of our specially chosen imaging geometry. Notice that a small rotation of the camera about the <math>x </math> axis would change the value of <math>v </math> while leaving <math>u </math> unaffected. Thus, this “extra” coordinate seems to carry some information about the spatial relationships between the cameras. This is in fact the case.
Using the spatial relationship between the two cameras, we can write <math>\ ^{c2}p = R \ ^{c1}p + T </math> with <math>R\in SO(3) </math> denoting a rotation matrix and <math>T\in \Re(3) </math> denoting a translation vector.
Note that the vectors <math>\ ^{c2}p, </math> <math>R \ ^{c1} p </math> and <math>T </math> all lie in a common plane — it is called an epipolar plane. This is a consequence of the fact that the optical centers of two cameras together with any point in space not colinear with the optical centers (that is 3 points in general position), define a plane in space. We can compute the normal to the plane by taking the cross product of two of these vectors. Here, we will use a trick and write the cross product operation as a skew-symmetric matrix <math>sk(T) </math>. Multiplying both sides of our equation above, we have
<math>sk(T) \ ^{c2}p = sk(T) R \ ^{c1}p </math>
(note that <math>sk(T) T = 0 </math>)
Now, we can take the dot product with <math>\ ^{c2}p </math> which, since it lies in the plane, should vanish. Indeed, we find that
<math>\ ^{c2}p \cdot sk(T) \ ^{c2}p = 0 = \ ^{c2}p \cdot sk(T) R \ ^{c1}p </math>
We see we can write this all as
<math> \ ^{c2}p^t E \ ^{c1}p = 0 </math>
where
<math>E = sk(T) R </math>
As a last step, we note that for a unit focal length camera, the projective coordinates of the projection of <math>p </math> is a vector <math>q\in\Re(3) </math> with <math>q = \lambda p </math>. Thus, we have now for projections <math>q_1 </math> of <math>p </math> in <math>c1 </math> and <math>q_2 </math> of <math>p </math> in <math>c2 </math>
<math> q_2^t E q_1 = 0 </math>
If we include the intrinsic coordinate transformation, <math> r = K q </math> into pixel coordinates, we have
<math> r_2^t K^{-t} E K^{-1} r_1 = r_2^t F r_1 = 0 </math>
There are a number of important properties of both of these matrices:
- Both <math>E </math> and <math>F </math> are <math>3×3 </math> matrices defined up to scale.
- Both are singular (can you show why?) and thus <math>det(E) = det(F) = 0 </math>.
- It follows that <math>F </math> has 7 degrees of freedom even though it is defined using 10 parameters.
- We can see that <math>E </math> has 5 degrees of freedom as it is constructed using 6 parameters, but is only unique up to scale.
- Given a point <math>q_1 </math> (respectively <math>r_1 </math>), we have <math>l = E q_1 </math> (respectively <math>l = F r_1) </math>. The vector <math> l \in Re(3) </math> defines the epipolar line upon which the corresponding value <math>q_2 </math> (respectively <math>r_2 </math> must fall.
- The unit vector <math>e </math> such that <math>E e = 0 </math> (or <math>F e = 0 </math> is the left epipole and is the locus of all epipolar lines; it corresponds to the projection of the optical center of <math>c2 </math> in <math>c1 </math>. There is a corresponding right epipole by solving <math>e^t E = 0 </math>.
There has been a great deal of literature devoted to computing the <math>E </math> and <math>F </math> matrices. Here, we describe the most straightforward approach, the so-called 8 point algorithm due to Longuet-Higgens. A good and complete description is available on Wikipedia.
Image Rectification
In a previous section, we showed that a non-verged camera setup had a very simple image, namely, each horizontal scanline (or row) of the <math>c1 </math> image was in correspondence with the same row in <math>c2 </math>.
If we consider the geometry of the cameras themselves, we see that we defined a geometry where:
- The coordinate axes of the cameras were aligned (that is, R = I)
- The only translation between the cameras as along the <math>x </math> axis
Another way to think of this, is that the image planes of a non-verged camera pair are always themselves co-planar.
We now show that it is possible to take the images from any pair of cameras and create a new pair of images which correspond to what a non-verged pair of cameras sited at exactly the same locations would have seen.
We again assume we have <math>\ ^{c1}p = R \ ^{c2}p + T. </math> For <math>c1 </math>, our goal is to produce a coordinate system with the following properties:
- <math>T </math> is the new <math>x </math> axis
- <math> z </math> points in roughly the same direction as the <math>z </math> axes of the original cameras
- <math>y </math> will then follow as <math>z \times x </math>
Let <math>R_{new} </math> denote this rotation.
The construction of <math>R_{new} </math> is as follows:
- Set <math>x_{new} = T/||T|| </math>
- Choose <math>z_{tmp} = x_{new} \times (0,1,0)^t </math>
- Set <math>z_{new} = z_{tmp}/||z_{tmp}|| </math>
- Set <math>y_{new} = z_{new} \times x_{new} </math>
- Set <math>R_{new} = \begin{bmatrix}x_{new}\\ y_{new} \\ z_{new} \end{bmatrix} </math>
Note that, by construction <math>R_{new} T = (||T||,0,0)^t </math> as we desire — <math>c2 </math> is along the x axis in the rotated coordinate system.
Given <math>R_{new}, </math> we can define a second matrix <math>R_{new’} = R_{new} R </math>. We claim that applying these rotations produces a non-verged pair of cameras.
The final step of the procedure is now to compute the new “virtual” image from the existing image. They key idea is to note that there is a homography that relates the two images. Again letting <math>K </math> denote the mapping from internal to pixel coordinates, we claim the following homography does the trick:
<math>H_{new} = K_{new} R_{new} K_{c1}^{-1} </math>
Namely, H takes points in pixel coordinates of the current image, converts them to internal coordinates, applies the rotation into the new system, and then converts them back to pixel coordinates using a new internal parameter matrix <math>K_{new}. </math> The latter is usually chosen to achieve some “nice” properties of the resulting image. In particular, notice that applying <math>R_{new} </math> to an image will tend to move the image to one side or another. In order to create a pixel grid that captures as much of the image as possible, it is usual to perform a recentering operation by adjusting the location of the image center in <math>K_{new} </math>. We leave this as an exercise.
With this, we can now implement a rectification algorithm. The key observations we use are the following:
- We can think of the elements of the pixel grid as having spatial coordinates <math>(u,v,1) </math> which we interpret now as a 3D point in space (think of the CCD elements as living one unit in front of the optical center of the camera).
- We can apply <math>H </math> to these coordinates; this effectively transforms the pixels to internal coordinates (unitless 3D points), rotates the points, and then converts the rotated points back to pixels.
- In order to make this convenient, it is usual to perform the operation in the reverse direction — that is, we choose points in the target image, and go backwards to find the locations in the source image. Typically, a bilinear interpolation is needed to perform the sampling since the source point will not occur on a pixel boundary
- We apply the same procedure to the second image using the homography <math>H_{new’} = K_{new’} R_{new’} K_{c2}^{-1} </math>
- We will usually apply an additional step of distortion correction prior to the homography.
- For most stereo pairs, the complete resampling chain is stored as a lookup table and can thus be implemented extremely efficiently.
The resulting pair of images now appear as a non-verged pair with a baseline of <math>||T|| </math>
As a final note, we point out that the same ideas can be applied to a three-camera system. In this case, since we have three points, the rotation is fully determined by the cameras; we are left with no choices. Also, in this case, we in fact use a general affine transformation rather than a rotation, thus allowing us to orthogonalize the axes defined by the camera systems. We can also use the scale factors in the “new” internal parameter matrices to ensure that the images have a common disparity step.
As a “final final” note, we cannot extend this idea further to more than three cameras — three points in space define a plane; if we add a fourth point (except in the case it is coplanar with the first three), there is no common image plane to which we can rectify.
Computational Stereo
Now, both of the threads of discussion come together. We know how to take any pair of cameras and make them a non-verged pair through rectification, and we know how to perform triangulation for non-verged pairs. However, the latter assumes that we know the corresponding pair of projected locations for a point in space. Thus, we must solve this so-called correspondence problem to be able to implement computational stereo.
Dense Matching Methods for Stereo Correspondence
We can define the correspondence problem as follows:
<it> Given a pair of images I1 and I2 that have been acquired simultaneously from a non-verged camera pair, for every location (u,v) in I1, compute the corresponding location (u,v’) in I2 such that (u,v) and (u,v’) are the projections of the same physical point in space, if such a point exists. </it>
There are several important points about this formulation:
- We have made explicit use of the non-verged geometry of the images by requiring corresponding pairs to share the same v coordinate.
- We define the problem as computing a match for every image location in I1, but we could chosen some smaller set of feature points instead. However, this will results in a sparse (as opposed to dense) reconstruction.
- We chose to match points in I1 to those in I2; we could
have just as well started with I2 and matched to I1. The only difference is the choice of coordinate system in which the final points will be expressed.
- It is most likely the case that the exact match in I2 is not at an integer pixel location.
We will solve this problem using a region-based matching algorithm, with region matched using any of the metrics defined inFeatures. For the purposes of discussion, we will assume we are working with the SAD metric since, in practice, that is one of the most commonly applied methods. We begin with a naive implementation of stereo matching:
for i=1:nRows for j=1:nCols minValue = INF thisDisparity = -1 for k=minDisparity:maxDisparity val = SAD(I1, i, j, I2, i, j+k, w, h) if (val < minValue) minValue = val thisDisparity = k end end disparities(i,j) = thisDisparity end end
Here, we have assumed the existence of a function SAD that computes the SAD value between two regions of size w by h in the two images, where the supplied coordinates define the center of the region.
If we add up the number of operations, we see that we perform w*h difference and absolute value operations for nRows*nCols*nDisparities tests. Quite a lot of computation!
We can do a little better then this. Note that in the current algorithm we in fact do some redundant work. Namely, suppose 0 disparity is one possibility and consider two cases: (i,j) vs (i,j) and (i, j+1) vs (i,j+1). These pairs of windows have (w-1)*h values in common! We are repeating computations — how can we avoid this?
Here is a better algorithm; here we have used matlab notation since we are going to deal with several parallel image-related operations:
%% This is all setup nDisparities = maxDisparity-minDisparity+1 firstCol = max(1,1-minDisparity) lastCol = min(nCols,nCols-maxDisparity) cols = firstCol:lastCol nComputedColumns = length(cols); %% These are the buffers for intermediate values minValues = INF*ones(nRows,nComputedColumns) disparities = -1*ones(nRows,nComputedColumns) %% And here is the entire stereo algorithm for k=minDisparity:maxDisparity vals = conv(abs(I1(:,cols) - I2(:,cols+k),fspecial('average',w,h),'same') disparities(vals < minValues) = k; minValues = min(minValues,vals) end
Consider now how many operations this algorithm performs. We perform nDisparities convolutions, each of which involve nRows*nCols differences and absolute values for a total value of nRow*nCols*nDisparities operations — a savings of a factor of w*h! Note also that the convolution can be easily implemented using separability and the running sums trick we talked about in our discussion of convolution. So, each of the convolutions is also only 4*nRows*nCols operations carried out nDisparities times. So, the overall algorithm is O(nRows*nCols*nDisparities).
Other Extensions to Stereo
The algorithm above computes a dense disparity map for a pair of cameras. There are several variations on this basic theme that have been developed. Here are a few
Multi-Camera Stereo
If we have three cameras forming roughly a right angle, it is still possible to use the same basic techniques for two cameras, except we now compute match values for all three cameras. This so-called trinocular stereo (unfortunately a well-established contradiction in terms) method has the advantage that it can make use of images that have variation along both rows and columns. For example, an image of horizontal lines with a non-verged pair will produce garbage — there is nothing to match along scanlines. However, a trinocular system will perform quite well with such a setup.
For more than three cameras, rectification is not possible. In this case, one possibility is to perform the sweep in depth values rather than in disparity values. In this method, a grid (or other set) of points in a plane is defined. This plane is tested as a set of depths, where the test involves projecting each of the grid points into the camera images, and computing a match value at the projected locations. The computations of this plane sweep algorithm are quite similar to that of the basic stereo algorithm, except that each point must be projected and sampled. Thus, the computation involved is far higher, but now arbitrary numbers of observing cameras can be used.
Left-Right Check
We noted earlier that we could choose which camera to use as the base for computations. In fact, it is advantageous to perform matching in both directions and test the consistency of the resulting disparity maps. For good matches, the disparities will typically agree, whereas for bad matches (e.g. a point which is occluded in one of the camera views), the disparities will often be inconsistent. With care, the left-right check can be made quick efficient as the computations needed to perform matching in one direction are also those needed to perform it in the other direction.
Global Matching
The basic stereo algorithm can, in principle, produce a depth image that has large variations in depth between every pixel. However, we expect, in general, that the world is composed of continuous and often smooth surfaces. For such surfaces, we expect small variation in depth from pixel to pixel except at a few points of discontinuity. There are several methods for incorporating smoothness. The two dominant methods are Dynamic Programming and Graph Cuts. These algorithms tend to produce higher quality stereo than almost any other algorithm.
Structure and Motion
You may not have noticed, but we have derived another interesting result. Suppose we have only one camera that is viewing a static scene. Further, suppose we know the camera intrinsic parameters. It follows that we can do the following for any pair of images:
- We can first compute the E matrix as described above
- From the E matrix, we can recover the translation and rotation between the two cameras. This is not trivial, but it is possible.
- Given this, we can now apply the stereo techniques described above
- Further, we can in fact now make use of multiple frames, and perform multi-camera stereo.
There is one hitch to this: we do not know the scale of the reconstruction. This follows from the fact that E is only known up to scale, and so the translation between the viewing locations is also only known up to scale. Hence, reconstruction is up to scale.
Another variation on this is to use feature matching rather than dense matching. In this case, we can consider jointly optimizing the camera locations as well as the locations of the observed points (all up to scale). The workhorse of these methods are a family of so-called bundle adjustment algorithms.