On the view synthesis of man-made scenes using uncalibrated cameras

Post on 27-Mar-2023

0 views 0 download

Transcript of On the view synthesis of man-made scenes using uncalibrated cameras

www.elsevier.com/locate/patrec

Pattern Recognition Letters 28 (2007) 384–393

On the view synthesis of man-made scenes using uncalibrated cameras

Geetika Sharma a, Santanu Chaudhury b,*, J.B. Srivastava a

a Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, Indiab Department of Electrical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110 016, India

Available online 9 June 2006

Abstract

We have attempted the problem of novel view synthesis of scenes containing man-made objects from images taken by arbitrary, uncal-ibrated cameras. Under the assumption of availability of the correspondence of three vanishing points, in general position, we proposetwo techniques. The first is a transfer-based scheme which synthesizes new views with only a translation of the virtual camera and com-putes z-buffer values for handling occlusions in synthesized views. The second is a reconstruction-based scheme which synthesizes arbi-trary new views in which the camera can undergo rotation as well as translation. We present experimental results to establish the validityof both formulations.� 2006 Published by Elsevier B.V.

Keywords: Image-based rendering; Novel view synthesis; Uncalibrated cameras

1. Introduction

In this paper we have addressed the problem of synthe-sizing new views of a scene using multiple images taken byuncalibrated cameras. This problem is a part of the broaderfield of image-based rendering (IBR). There are three mainapproaches to problems in IBR. The first is the transfer-based approach in which a transfer equation is used to pre-dict the novel view. The second is the reconstruction-basedapproach in which a model of the scene is constructed andnew views are obtained by reprojecting the model. Thethird approach is based on the light field or plenoptic func-tion which captures the intensity distribution of all lightrays incident at a point in the scene. New views are synthe-sized by estimating the plenoptic function from given viewsand interpolating.

We work with the assumption that the correspondenceof three vanishing points, in general position, is availablein the given views and propose two techniques – a trans-

0167-8655/$ - see front matter � 2006 Published by Elsevier B.V.

doi:10.1016/j.patrec.2006.04.010

* Corresponding author. Fax: +91 11 6966606.E-mail addresses: geetikas@gmail.com (G. Sharma), santanuc@ee.iitd.

ernet.in (S. Chaudhury), jbsrivas@maths.iitd.ernet.in (J.B. Srivastava).

fer-based scheme and a reconstruction-based scheme.While we make no assumptions on the motion of the cam-eras, the only restriction on internal parameters is of zeroskew. In fact, our scheme can be used to synthesize novelviews using frames extracted from a motion picture andwe provide examples of the same.

Our transfer-based scheme uses the infinite homographybetween the two given views to compute z-buffer values.These values are used to handle changes in visibilitybetween objects in the scene due to camera motion in thesynthesized view. However, the synthesized views are lim-ited to those with only a translation of the virtual camera.While this scheme can be used to synthesize translationalpan video sequences, we were motivated to derive a recon-struction-based scheme which can synthesize views withtranslation as well as rotation of the virtual camera. In thisscheme, the three vanishing points are used to setup anaffine coordinate system, similar to Criminisi (1999), andthe scene is reconstructed in this coordinate system. Wethen describe how to specify a new camera in this coordi-nate system to synthesize arbitrary novel views. Prelimin-ary results for both schemes with only a translation ofthe virtual camera have appeared in (Sharma et al.,2004). We have proposed a technique for view synthesis

G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393 385

under the simpler assumption of translating cameras in(Sharma et al., 2005).

In (Criminisi, 1999), three vanishing points in generalposition are used to setup a coordinate system in the world.Two points in a single image are said to be corresponding

points if one is on one of the coordinate planes and theother is on the line through the first point parallel to thecoordinate axis perpendicular to the plane. Reconstructionfrom a single image can be done but only for those pointsfor which at least one of the corresponding points is known.Our technique uses two or more views to reconstruct eachpoint visible in at least two of the given views. Also, Crim-inisi (1999) uses two views to make certain measurements,for example, the affine distance of the camera centres to achosen world plane. However, they require the ratio ofthe distances of two reference points from the world planeto be known.

Other work in view synthesis includes Avidan and Shas-hua (1998) which uses the trifocal tensor between threeviews taken by uncalibrated cameras. However, they can-not handle occlusions in the new view. A novel representa-tion consisting of multiple depth and intensity levels fornew view synthesis is proposed in (Chang and Zakhor,2001), however, it requires calibrated cameras. In (Gencand Ponce, 2001), the constraints imposed by weak-per-spective and para-perspective cameras are used for viewsynthesis while we work with perspective cameras. In(Inamoto and Saito, 2002; Saito et al., 2002; Yaguchi andSaito, 2002), a set of views of a scene taken by uncalibratedcameras are used to reconstruct the scene in projective gridspace defined by the epipolar geometry between two chosenbasis cameras. This reconstruction provides a volumetricprojective model and new views are synthesized using inter-polation. However, changes in visibility cannot be resolved.In (Lhuillier and Quan, 2003), a method for obtaining aquasi-dense disparity map and a novel representation ofa pair of images taken by uncalibrated cameras is pro-posed. They handle occlusions by rendering new views ina heuristically determined order using disparity values.Computing disparity from image information alonerequires rectified images which correspond to a translationof the camera in the direction of the u-axis of the image.Our technique can handle occlusions using z-buffer valuesand arbitrary orientation and location of the input cam-eras. View Morphing, Seitz and Dyer (1996), also workswith uncalibrated cameras. They first rectify and then inter-polate the given views to generate a new view. A projectivetransformation of the interpolated new view changes thedirection of gaze. However, they can synthesize novel viewsonly along the line joining the centres of projectionof the given views and use disparity to handle occlusions.Recently, a probabilistic approach to view synthesis hasbeen proposed by Strecha et al. (2004) and Fitzgibbonet al. (2003), however they require the camera matrices tobe known.

Our reconstruction-based scheme can be used in aug-mented reality applications to reconstruct an object in a

local affine coordinate system. Then the object can beinserted into different scenes which may be synthetic orreconstructed from images themselves. Also, in case oforthogonal vanishing points since the ambiguity in thereconstruction is of scales only, the object can be repro-jected at different scales provided by the user. Thus videosmay be synthesized in which the inserted object grows orshrinks in size.

This paper is organised as follows. In Section 2 wedescribe our transfer-based scheme and the reconstruc-tion-based scheme in Section 3. In Section 4 we describeour technique for specifying a new camera in an affinecoordinate system. The results for the transfer-basedscheme are presented in Section 2 while those for the recon-struction-based scheme are presented at the end of Section4. We conclude in Section 5.

2. Transfer-based scheme

In this section we describe our transfer-based schemewhich uses the knowledge of the infinite homography tosynthesize new views. Although, the assumption of the infi-nite homography is equivalent to an affine reconstruction,we refer to this scheme as transfer-based because the newview is predicted using an equation in which the cameraparameters do not occur.

We use normal face, uppercase letters, e.g. R, to denotematrices. Boldface letters are used to denote vectors withuppercase letters, e.g. P, denoting points in the 3D worldand lowercase letters, e.g. pi, denoting points in the images.The superscript denotes the image number in which thepoint lies. Images are denoted by italicised uppercase nor-mal face letters like I1.

2.1. Synthesis from two views

Let P = (X,Y,Z)T be a point in the scene and pi = (ui,vi, 1)T its image in the ith view, i = 1,2. Let I1 and I2 bethe two given views. Then, the well-known projection equa-tions (see e.g. Hartley and Zisserman, 2000) for I1 and I2

are

k1p1 ¼ K1ðR1P þ t1Þ ð1Þk2p2 ¼ K2ðR2P þ t2Þ ð2Þ

where

Ki ¼f isi

u 0 xi0

0 f isiv yi

0

0 0 1

0B@

1CA; i ¼ 1; 2

are the matrices of internal parameters with f i denoting thefocal length, si

u and siv the scale factors along the u and v

axes and ðxi0; y

i0Þ the principal point in the ith image. Also,

Ri are the rotation matrices and ti = (xi,yi,zi)T, i = 1,2, are

the translation vectors. Note that ki is the depth of thepoint P in the coordinate system of the ith camera.

386 G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393

Without any loss of generality, we may assume that theorigin of the world coordinate system is visible in boththe views. Under this representation, neither camera isof the form Ki[I 0], where I is the identity matrix and 0 isthe 3 · 1 zero vector. Given n point correspondencesbetween I1 and I2, we may fix one of them as the originof the world coordinate system. Let pi

0 denote its imagein the ith view. Then, ki

0pi0 ¼ Kiti; i ¼ 1; 2.

Eliminating P from (1) and (2), we get

k2p2 ¼ K2R2R�11 K�1

1 ðk1p1 � k1

0p10Þ þ k2

0p20

¼ H 121ðk

1p1 � k10p1

0Þ þ k20p2

0 ð3Þ

where H 121 is infinite homography from I1 to I2. If H 12

1 isknown, the only unknowns in (3) are k1

0; k20 which are fixed

and k1,k2 for each point correspondence. Since the equa-tions are linear, we can solve for all the unknowns, givenn P 2 point correspondences.

Note that in (3) the camera parameters cannot be elim-inated and are replaced by the infinite homography H 12

1,which models them up to an unknown affine transforma-tion. Computing the infinite homography amounts to anaffine reconstruction of the scene and the k’s that we com-pute are the depths for this reconstruction which is anaffine transformation of the true Euclidean scene. Our tech-nique for synthesizing new frames then produces correctperspective views of the affinely transformed scene. Thus,when compared with the given views, the synthesized viewssuffer from affine distortion which may be removed usingmetric rectification techniques (Liebowitz and Zisserman,1998). Note, however, that we do not do an explicit re-construction of the scene. We only compute k’s which actas z-buffer values required for occlusion handling.

2.1.1. Synthesis of virtual views

Suppose I1 is translated to obtain the new view Is. Then,Is and I1 have the same orientation and internal parametersbut different positions, i.e., the matrix of camera internalsfor Is is K1 and the rotation matrix is R1. Then, the projec-tion equation for the novel view Is is

ksps ¼ K1ðR1P þ t1 þ tsÞ

where, ts = (xs,ys,zs)T. Using (1), we get

ksps ¼ k1p1 þ K1ts ð4ÞWriting the above equation for the origin,

ks0

us0

vs0

1

0B@

1CA ¼ k1

0

u10

v10

1

0B@

1CAþ K1

xs

ys

zs

0B@

1CA ð5Þ

Equating the last coordinate on both sides and rearranging,we get

zs ¼ ks0 � k1

0

Since, k10 is fixed by the first view, different choices of ks

0

correspond to different z translations of the camera. So,the translation of the virtual camera in the z-direction

may be chosen by giving a value to ks0. Now, equating

the first coordinate on both sides and rearranging, we get

xs ¼ks

0us0 � k1

0u10 � x1

0zs

f 1s1u

Again, since k10, u1

0, f 1, s1u and x1

0 are fixed by the first viewand ks

0 and zs have been fixed in the previous step, differentvalues of us

0 correspond to different translations xs in the x-direction. Thus, the x translation of the virtual camera maybe chosen, interactively, by giving a value to us

0. Similarly,the y translation of the virtual camera may be chosen bygiving a value to vs

0. It follows that a choice of ps0 and ks

0

fixes the translation of the virtual camera and specifiesthe new view. Substituting K1ts from (5) in (4), we get

ksps ¼ k1p1 þ ks0ps

0 � k10p1

0 ð6ÞThe positions and z-buffer values ks of point correspon-dences in the new view can be obtained from the aboveequation. Since we do not assume dense correspondences,the new view and the given views are triangulated usingcorresponding points as vertices. Triangles in the new vieware then texture mapped by combining textures from corre-sponding triangles in the given views. The computed ks’sact as z-buffer values for corresponding points and theycan be used to handle occlusions while rendering the newview. Thus, given n point correspondences the algorithmfor synthesis of new views is as follows:

1. Setup a system of equations using (3) and compute ki’s,i = 1,2.

2. Specify a new view by giving values to us0, vs

0 and ks0.

3. Render the position of corresponding points in the newview using (6). Resolve any visibility issues using ks’s asz-buffer values.

4. Triangulate the new view using the rendered points asvertices and texture map the triangles from the givenviews.This technique can be easily extended to work with threeor more input views. We would only require that thecorrespondence of vanishing points be available in allthe given views and that there be at least one point thatis visible in all the views which can act as the origin.

Fig. 1 shows the results for this section. (a) and (f) arethe input views. (b)–(e) are the synthesized views with (b)and (c) parallel to (a) and (d) and (e) parallel to (f). (g)–(i) are three views synthesized using the reconstruction-based scheme. Observe that (b)–(e) suffer from an obviousaffine distortion while the distortion is much less in (g)–(i).An explanation for this difference will be given in Section 4.

2.1.2. Characterisation of viewpoints

We now give a characterisation of those viewpoints forwhich it is guaranteed that all points within the fields ofview of the given cameras are also in the field of view ofthe virtual camera. This is required because it is possiblethat the chosen translation of the virtual camera is so large

Fig. 1. (a) and (f) are the input views, (b) and (c) are the synthesized views parallel to (a), and (d) and (e) are synthesized views parallel to (f). (g)–(i) arethree views synthesized using the reconstruction-based scheme.

G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393 387

that none of the corresponding points are within its field ofview (FOV) and consequently, none are visible in the imageplane. We assume that the motion of the camera is of trans-lation in the x–z plane and rotation about the y-axis.

Without loss of generality, we may assume that the syn-thesized view is parallel to the first view. The characterisa-tion of the viewpoints can be done in the coordinate systemof the first camera since the new view is specified withrespect to the position of the first camera. The first camerais then given by K1[I|0] and the second camera by K2[R1|t1].Also, we assume that, R1, the rotation between the secondand first cameras is

R1 ¼cos h 0 sin h

0 1 0

� sin h 0 cos h

264

375

and that t1 = (x1,0,z1)T is the translation of C2.The view-volume of a camera is defined by four planes

which intersect at the centre of projection and pass throughthe boundaries of the image plane. In order to ensure thatall points in the fields of view of the given views are also inthe FOV of the novel view, we will intersect the view-vol-umes of the given cameras and require that the view-vol-ume of the new camera contain the intersection. For thefirst view, let planes defining the view-volume be p1 (left),p2 (bottom), p3 (right) and p4 (top). Let a1

6 u16 b1 and

c16 v1

6 d1, be the bounds for the first image plane anda26 u2

6 b2 and c26 v2

6 d2, be the bounds for the secondimage plane. Then, the equations of these planes are

p1 : f 1X � a1Z ¼ 0; p2 : f 1Y � d1Z ¼ 0

p3 : f 1X � b1Z ¼ 0; p4 : f 1Y � c1Z ¼ 0

Let Ni denote the normal of pi, i= 1,2,3,4, that points out-side the view-volume. Let P = (X,Y,Z)T be any pointinside the view-volume. Then, the dot product of Ni,i = 1,2,3,4, with the vector joining any point on the planepi and P is less than zero. Choosing the centre of projectionas the point on the plane, P will lie inside the view-volumeif, Ni Æ P < 0. Now, N1 = (�f 1,0,a1)T and N3 = (f 1,0,�b1)T. So, N1 Æ P = �f 1X + a1Z < 0

) X >a1

f 1Z ð7Þ

Similarly, N3 Æ P = f1X � b1Z < 0

) X >b1

f 1Z ð8Þ

Combining (7) and (8) and writing similar inequalities forp2 and p4, we get

a1

f 1Z < X <

b1

f 1Z ð9Þ

c1

f 1Z < Y <

d1

f 1Z ð10Þ

Similarly, if p01, p02, p03 and p04 are the planes defining thesecond view-volume and N 0i are the outward pointingnormals, then

388 G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393

p01 : ðf 2 cos hþ a2 sin hÞX þ ðf 2 sin h� a2 cos hÞZ

þ f 2x1 � a2z1 ¼ 0

p02 : d2 sin hX þ f 2Y � d2 cos hZ � d2z1 ¼ 0

p03 : ðf 2 cos hþ b2 sin hÞX þ ðf 2 sin h� b2 cos hÞZ

þ f 2x1 � b2z1 ¼ 0

p04 : c2 sin hX þ f 2Y � c2 cos hZ � c2z1 ¼ 0

Again requiring N 0i � P < 0, a point P will lie in the view-volume of the second camera if the following conditionsare satisfied.

�f 2 sin hþ a2 cos hf 2 cos hþ a2 sin h

ðZ þ z1Þ � x1

< X <�f 2 sin hþ b2 cos h

f 2 cos hþ b2 sin hðZ þ z1Þ � x1 ð11Þ

c2 cos hf 2

ðZ þ z1Þ �c2 sin h

f 2ðX þ x1Þ

< Y <�d2 cos h

f 2ðZ þ z1Þ þ

d2 sin hf 2

ðX þ x1Þ ð12Þ

A point P which lies in both the view-volumes will be vis-ible in both the images. Thus, for corresponding points,inequalities (9)–(12) hold simultaneously.

A new view is parallel to the first view and has cameramatrix K1[I|ts] where, ts = (xs,ys,zs)

T. For a point P, visiblein the view-volumes of the given images, to be visible in theview-volume of the new view, we must have

a1

f 1ðZ þ zsÞ � xs < X <

b1

f 1ðZ þ zsÞ � xs ð13Þ

c1

f 1ðZ þ zsÞ � ys < Y <

d1

f 1ðZ þ zsÞ � ys ð14Þ

For the new view to be such that every point visible in thegiven views is also visible in the new view, (9)–(12) must im-ply (13) and (14). It follows that the translation xs shouldbe such that

xs >a1

f 1zs�max 0;

�f 2 sinhþ a2 coshf 2 coshþ a2 sinh

ðZ þ z1Þ �a1

f 1Z � x1

� �

ð15Þ

xs <b1

f 1zs�min 0;

�f 2 sinhþ b2 cosh

f 2 coshþ b2 sinhðZ þ z1Þ �

b1

f 1Z � x1

� �

ð16Þ

For, inequality (15), implies

a1

f 1zs� xs <max 0;

�f 2 sinhþa2 coshf 2 coshþa2 sinh

ðZþ z1Þ�a1

f 1Z� x1

� �

) a1

f 1zs� xs < 0

or�f 2 sinhþa2 coshf 2 coshþa2 sinh

ðZþ z1Þ�a1

f 1Z� x1

) a1

f 1zs� xsþ

a1

f 1Z <

a1

f 1Z

or�f 2 sinhþa2 coshf 2 coshþa2 sinh

ðZþ z1Þ� x1

ðZ > 0 since all points are in front of the cameraÞ

) a1

f 1zs� xsþ

a1

f 1Z <X

which is the left half of inequality (13). Similarly, it can beproved that inequality (16) implies the right half of inequal-ity (13). Note the presence of Z in the above inequalities.This Z is the depth of the point P in the coordinate systemof the first camera. The k1 computed for each correspond-ing point in the synthesis algorithm is precisely this depth.Thus, we may replace Z by max(Z) and min(Z) by compar-ing the k1’s. If the cameras internal parameters and the mo-tion parameters of the second camera are known we canobtain a relation for xs and ys and thus restrict the transla-tion ts of the virtual camera so that all points in the field ofview of I1 and I2 are also in the field of view of Is.

3. Reconstruction-based scheme

In this section we present our reconstruction-based tech-nique for view synthesis from multiple uncalibrated cameras.

Suppose we are given two views, I1 and I2, of a scenesuch that the correspondence of three vanishing points,in general position, can be established between them. Wemay setup a world coordinate system with axes in the direc-tions of these vanishing points and any world point, whosecorrespondence is available in the given views, as origin.Note that the coordinate system need not be orthogonalsince the vanishing points need not be in mutually orthog-onal directions. Let v1

1; v12; v

13 be the vanishing points in I1

and v21; v

22; v

23 those in I2. Then, the camera matrices

of the two views may be parameterised as M1 ¼ ½av11bv1

2cv1

3p10� and M2 ¼ ½a0v2

1b0v2

2c0v2

3d0p2

0�, where p10 and p2

0 are theimages of the origin of the world coordinate system inthe two views.

We know from Criminisi (1999) that if P = (X,Y, 0,1)T

is a point on the world XY plane and P 0 = (X,Y,Z, 1)T isa point such that the line joining P and P 0 is parallel tothe world Z-axis, then,

cZ ¼ kp� p0kð�l � pÞkv1

3 � p0k

G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393 389

where p and p 0 are the images of P and P 0 in the first view,respectively, and �l ¼ l

klk ¼ p10, where l is the vanishing line in

the first view of the world XY plane. Here, the origin, p10, is

chosen as the image point �l so that the origin does not lieon the vanishing line l. Note that here only one view is re-quired to compute cZ. Also, similar formulae can be de-rived for computing aX and bY for appropriately chosenpoints P and P 0. In the following, we extend this idea totwo views and give an algorithm to compute the coordi-nates (aX,bY,cZ) for all points whose correspondencesare available in the two views.

3.1. Relating the two cameras

We first derive a novel relation between the parametersa, a 0 and d 0, b, b 0 and d 0 and c, c 0 and d 0.

Let Q1 =(X, 0,0,1)T be a point on the world X-axis. Letq1

1 and q21 be the images of Q1 in the two images respec-

tively. Then,

k1q11 ¼ av1

1X þ p10

) ðk1q11Þ � q1

1 ¼ aX ðv11 � q1

1Þ þ ðp10 � q1

1Þ ¼ 0

) aX ¼ �kp10 � q1

1kkv1

1 � q11k

ð17Þ

Similarly, from the second view,

a0

d0X ¼ �kp

20 � q2

1kkv2

1 � q21k

ð18Þ

Dividing (18) by (17), we get,

a0

ad0¼ kp

20 � q2

1kkv11 � q1

1kkp1

0 � q11kkv2

1 � q21k) a0 ¼ C1ad0

where C1 ¼kp2

0�q2

1kkv1

1�q1

1k

kp10�q1

1kkv2

1�q2

1k. Similarly, choosing Q2=

(0,Y, 0,1)T on the world Y-axis and Q3 =(0,0,Z,1) on theworld Z-axis, we get

b0 ¼ C2bd0; c0 ¼ C3cd0

where C2 ¼kp2

0�q2

2kkv1

2�q1

1k

kp10�q1

2kkv2

2�q2

2k and C3 ¼

kp20�q2

3kkv1

3�q1

3k

kp10�q1

3kkv2

3�q2

3k.

To be able to compute the constants Ci, i = 1,2,3, weneed the correspondences of the points Q1, Q2 and Q3.We describe here how to compute q1

1 and q21; q1

i and q2i ,

i = 2,3 can be computed in a similar manner. The imageof the world X-axis in the first view is the line throughp1

0, the image of the origin and v11, the vanishing point in

the X-direction. So, q11 may be chosen to be any point on

this line. The corresponding point q21 will lie on the epipolar

line, l2q, of q1

1 � l2q can be computed from the fundamental

matrix, F12, from the first to the second view. Also, q21 must

lie on the image of the X-axis in the second view, which isthe line through p2

0 and v21. Thus, q2

1 is the intersection of l2q

and the image of the X-axis in the second view. So, each ofthe points q1

i and q2i can be computed and so the constants

Ci, i = 1,2,3 can be determined. The camera matrix M2

may then be parameterised as M2 ¼ d0½C1av21C2bv2

2C3cv23p2

0�.

3.2. Reconstructing the scene

Now, let P = (X,Y,Z, 1) be any point in the worldwhose correspondence, p1 and p2, in the two views is avail-able. Then, we have

k1p1 ¼ v11ðaX Þ þ v1

2ðbY Þ þ v13ðcZÞ þ p1

0

k2p2 ¼ d0½C1v21ðaX Þ þ C2v2

2ðbY Þ þ C3v23ðcZÞ þ p2

0�k2

d0p2 ¼ C1v2

1ðaX Þ þ C2v22ðbY Þ þ C3v2

3ðcZÞ þ p20

We get six linear equations in the five unknowns k1, k2

d0 , aX,bY, cZ. These can be solved using singular value decompo-sition. Thus, we can compute the z-buffer value k1 and thecoordinates, (aX,bY,cZ), for each point whose correspon-dence is known. If the parameters a, b and c are knownthen the exact value of (X,Y,Z) can be obtained.

3.3. Synthesis from three or more views

Our method can be easily extended to handle three ormore input views. We only require that the correspondenceof the three vanishing points in different directions be avail-able in all the given views and that the origin be visible inall the views. Once these conditions are satisfied, constantsCi relating the parameters a,b,c of the first camera withthose of other cameras may be found. Consequently, affinecoordinates for additional point correspondences may beobtained and new views can be synthesized.

3.4. An appraisal of the two schemes

The two view-synthesis schemes we have outlined workwith the same basic assumption of the availability of corre-spondence of three vanishing points in general position.While the transfer-based scheme is less computationallyintensive and requires less user interaction than the recon-struction-based scheme, it produces only those views forwhich the virtual camera undergoes a translation withrespect to the first camera. Also, since no explicit choiceof the world coordinate system is made, the synthesizedviews suffer from affine distortion which may include ashear as well as scale distortion. The reconstruction-basedscheme on the other hand requires more user interactionin terms of choosing the points q1

i ; i ¼ 1; 2; 3 to computethe constants Ci, i = 1,2,3 and also selecting the centre ofprojection and image plane for the new view. However, ifthe vanishing points are chosen corresponding to threemutually perpendicular directions in 3D, the synthesizedviews suffer only from a scale distortion.

4. Synthesis with an arbitrary virtual camera

The reconstruction scheme described above can be usedto synthesize arbitrary new views, i.e., views in which thecamera undergoes rotation as well as translation. However,since the reconstruction of the scene and cameras is in an

C 1

C 2C s

P1

P2P3

PiPj

Pk

Q3

Q2Q1

Fig. 2. Selecting a new image plane.

390 G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393

affine coordinate system, we cannot directly specify a rota-tion and translation for the virtual camera. We nowdescribe a method for specifying a new image plane andcentre of projection for the virtual camera.

4.1. Specifying a new camera

We may obtain a centre of projection, Cs, for the virtualcamera as in Section 2.1.1 by choosing the image of the ori-gin, ps

0, in a translated image plane. The camera matrix forthis view is then given by M trans ¼ ½av1

1bv12cv1

3ps0� since the

virtual image plane is parallel to the image plane of the firstcamera. The affine coordinates of Cs can be obtained byrequiring that Mtrans map it to zero. If a view of the scenewith only translational motion of the camera is required,we can use Mtrans to obtain the projections of the recon-structed scene points and render the complete image as inSection 2.1.1.

If a view with a general motion of the camera is required,we need to specify a new image plane. Since the centre ofprojection has been chosen, the back projected rays tothe reconstructed scene points get fixed and the new imageplane must intersect them. Three points define a planeuniquely, so we choose three scene points, say, P1, P2

and P3 and obtain the equations of the rays CsP1, CsP2

and CsP3. As shown in Fig. 2, we choose a point, Qi,i = 1,2,3, on each of the three chosen back projected raysand the new image plane is taken to be the plane passingthrough these points.

Next, we need to setup a coordinate system on the imageplane to express the coordinates of the projections of thescene points. For this we make use of the camera matrixMtrans. We have

M trans ¼ ½av11bv1

2cv13ps

0� ¼pT

y

pTx

pT

264

375

Fig. 3. (a) and (f) are two images of an outdoor scene. (b)–(e) are synthesized vi

It is well known (Hartley and Zisserman, 2000) that thefirst row of a camera matrix represents a plane py throughthe centre of projection and the y-axis of the image plane,the second row represents a plane px through the centre ofprojection and the x-axis and the third row represents aplane p through the centre of projection parallel to the im-age plane. We take the intersections of the planes px and py

with the new image plane. This gives two lines, lx and ly, onthe new image plane and although px and py are orthogo-nal, lx and ly need not be. However, we may choose one thelines, say ly, as the y-axis and the point of intersection of lxand ly as the origin of the image coordinate system. Thex-axis is then a line passing through the origin and perpen-dicular to the y-axis. Once the coordinate axes have beencomputed, we may obtain the camera matrix, Marb, as fol-lows. The first row of Marb is the plane through Cs and thenew y-axis and the second row is the plane through the newx-axis. The third row is a plane through Cs parallel to theimage plane. Since the equation of the image plane isknown, the equation of this plane can also be obtained.Thus, we can compute all the three rows of the cameramatrix Marb.

ews. Observe the changes in visibility between the tree and the background.

Fig. 4. (a) and (b) are two images of an outdoor scene. (c) is a new view parallel to (a), and (d) is a new view parallel to (b).

Fig. 5. (a) and (f) are the input views, (b) and (c) are synthesized views parallel to (a), and (d) and (e) are synthesized views parallel to (f).

G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393 391

The images of all scene points can be obtained by pro-jecting them using Marb. Next, a window is chosen onthe image plane which defines the finite portion of theimage plane to which the scene projects. This windowis scaled to the desired image size using a transformationsimilar to the window to view-port transformation of thegraphics pipeline. Finally, the new image is rendered by

Fig. 6. (a)–(c) are the input views, (d)–(f)

triangulation and texturing mapping from the givenimages.

Although the choice of the points, Q1, Q2 and Q3, defin-ing the image plane is arbitrary, in practice it is beneficial torestrict the set of possible image planes by imposing addi-tional constraints like fixing the distance of Qi, i = 1,2,3from the principle plane.

are synthesized views parallel to (a).

Fig. 7. (a) and (h) are two frames from the motion picture Harry Potter and the Sorcerer’s Stone. (b)–(d) are synthesized views parallel to (a), and (e)–(g)are synthesized views parallel to (h).

Fig. 8. (a) and (d) are two images of an indoor scene. (b)–(d) are arbitrary synthesized views.

Fig. 9. (a) and (f) are two images of an outdoor scene. (b)–(e) are arbitrary synthesized views.

392 G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393

4.2. Results

Fig. 3 shows two images of an outdoor scene with alarge displacement of the tree between the two views. (b)–(e) are synthesized views in which the changes in visibilitybetween the background and the tree have been renderedcorrectly. Fig. 4(a) and (b) are two images of an outdoorscene. (c) is a novel view parallel to (a) while (d) is a novelview parallel to (b).

Fig. 5 shows the results using the reconstruction-basedscheme. (a) and (f) are two input images from which virtualviews (b) and (c) parallel to (a) are synthesized and virtualviews (d) and (e) parallel to (f) are synthesized. Fig. 1(g)–(i)are three views synthesized using this formulation. Note

that the affine distortion is much less in this case. In fact,if the vanishing points are chosen so that they correspondto three mutually orthogonal directions in the 3D world,there is only a scale distortion in the synthesized views.However, in the transfer-based scheme even with orthogo-nal vanishing points there is an affine distortion becausewe do not make an explicit choice of the 3D coordinateaxes in the orthogonal directions.

Fig. 6 shows the results using three images taken byarbitrary cameras. Using three views, we can get corre-spondences on the outlined box which would not be avail-able had we used only views (a) and (c). Fig. 7(a) and (h)are two frames taken from the motion picture Harry Potter

and the Sorcerer’s Stone and (b)–(d) are virtual views

Fig. 10. (a) and (b) are the input views. (c) is a third view of the same scene, and (d) is the synthesized view obtained using the same camera as that of (c).

G. Sharma et al. / Pattern Recognition Letters 28 (2007) 384–393 393

parallel to (a) while (e)–(g) are virtual views parallel to (h).Note that in all cases visibility changes have been correctlyhandled.

Figs. 8 and 9 show two sets of input images on which wetested our technique for synthesis of arbitrary views. Theimage planes for the synthesized views were chosen byselecting different sets of points P1, P2 and P3 from thereconstructed scene points. The choice of these pointsalong with constraints on the Qi, i = 1,2,3, is instrumentalin specifying a proper position for the image plane.

In order to evaluate the performance of our algorithm,we took three synthetic images generated using the soft-ware 3D Studio Max. The images are shown in Fig. 10and out of these (a) and (b) were chosen as the inputimages and a reconstruction of corresponding points wasobtained. Then, we estimated the camera correspondingto view (c) in the affine space of the reconstruction usingthe correspondence of vanishing points and the image ofthe origin. The synthesized view is shown in (d). To mea-sure the similarity between the synthesized and the actualview, we used the root mean squared distance between cor-responding points in the two views. The root mean squareddistance between views (c) and (d) was computed to be 4.5pixels, however a visual examination of the images showsonly a slight difference between the images due to the scaleambiguity of the affine reconstruction. We also found theaccuracy of the vanishing points to play an important rolein the reconstruction and subsequent reprojection. Themaximum likelihood estimation of vanishing points out-lined in (Liebowitz and Zisserman, 1998) may be used toget a good estimate of the vanishing points.

5. Conclusion

We have presented two techniques for view synthesisfrom multiple uncalibrated cameras. Both schemes requirethe knowledge of the three vanishing points in general posi-tion. While the first scheme can handle occlusion changesin the synthesized view it can synthesize those views inwhich the camera undergoes translational motion only.Thus, it can be used to synthesize translational pan videosfrom two or more views of a static scene. The secondscheme reconstructs the scene in an affine coordinate sys-tem. We have proposed a technique to specify a new cam-

era in this coordinate system and so we can synthesizearbitrary new views without building an explicit model ofthe scene.

Acknowledgment

This work was supported by the DST project E Video:

Video Information Processing with Enhanced Functionality.

References

Avidan, S., Shashua, A., 1998. Novel view synthesis by cascading trilineartensors. IEEE Trans. Visualization Comput. Graphics 4 (4), 293–306.

Chang, N.L., Zakhor, A., 2001. Constructing a multivalued representationfor view synthesis. Internat. J. Comput. Vision 45 (2), 157–190.

Criminisi, A., 1999. Accurate Visual Metrology from Single and MultipleUncalibrated Images. Ph.D. Thesis. Department of EngineeringScience, University of Oxford.

Fitzgibbon, A., Wexler, Y., Zisserman, A., 2003. Image-based renderingusing image-based priors. In: Proc. IEEE Internat. Conf. on ComputerVision, vol. 2.

Genc, Y., Ponce, J., 2001. Image-based rendering using parameterisedimage varieties. Internat. J. Comput. Vision 41 (3), 143–170.

Hartley, R., Zisserman, A., 2000. Multiple View Geometry in ComputerVision, first ed. Cambridge University Press.

Inamoto, N., Saito, H., 2002. Intermediate view generation of soccer scenefrom multiple views. In: Proc. ICPR.

Lhuillier, M., Quan, L., 2003. Image-based rendering by joint viewtriangulation. IEEE Trans. Circ. Systems Video Technol. 1 (11), 1051–1063.

Liebowitz, D., Zisserman, A., 1998. Metric rectification for perspectiveimages of planes. In: Proc. IEEE Conf. on Computer Vision andPattern Recognition, pp. 482–488.

Saito, H., Kimura, M., Yaguchi, S., Inamoto, N., 2002. View interpola-tion of multiple cameras based on projective geometry. In: Internat.Workshop on Pattern Recognition and Understanding for VisualInformation Media.

Seitz, S.M., Dyer, C.R., 1996. View Morphing. In: Proc. SIGGRAPH,pp. 21–30.

Sharma, G., Chaudhury, S., Srivastava, J.B., 2004. View-synthesis ofscenes with man-made objects using uncalibrated cameras. In: Proc.Indian Conf. on Computer Vision, Graphics and Image Processing,pp. 283–289.

Sharma, G., Kumar, A., Kamal, S., Chaudhury, S., Srivastava, J.B., 2005.Novel view synthesis using a translating camera. Pattern RecognitionLett. 26, 482–493.

Strecha, C., Fransens, R., Van Gool, L., 2004. Wide-baseline stereo frommultiple views: A probabilistic account. In: Proc. IEEE Conf. onComputer Vision and Pattern Recognition.

Yaguchi, S., Saito, H., 2002. Arbitrary viewpoint video synthesis fromuncalibrated multiple cameras. In: Proc. WSCG.