Real-time stereo-based view synthesis algorithms: A unified framework and evaluation on commodity...

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication 24 (2009) 49–64

0923-59

doi:10.1

� Cor

3001 Le

E-m

jiangbo

gauthie

journal homepage: www.elsevier.com/locate/image

Real-time stereo-based view synthesis algorithms: A unifiedframework and evaluation on commodity GPUs

Sammy Rogmans a,b,�, Jiangbo Lu a,c, Philippe Bekaert b, Gauthier Lafruit a

a Multimedia Group, IMEC, Kapeldreef 75, 3001 Leuven, Belgiumb Hasselt University – tUL – IBBT, Expertise Centre for Digital Media, Wetenschapspark 2, 3590 Diepenbeek, Belgiumc Department of Electrical Engineering, Katholieke Universiteit Leuven, Belgium

a r t i c l e i n f o

Article history:

Received 10 October 2008

Accepted 19 October 2008

Keywords:

Stereo correspondence

View synthesis

Image-based rendering

Performance evaluation

GPGPU

65/$ - see front matter & 2008 Elsevier B.V. A

016/j.image.2008.10.005

responding author at: Multimedia Group, IM

uven, Belgium. Tel.: +32 16 287756; fax: +32

ail addresses: [email protected] (S. Ro

@imec.be (J. Lu), [email protected]

[email protected] (G. Lafruit).

a b s t r a c t

Novel view synthesis based on dense stereo correspondence is an active research

problem. Despite that many algorithms have been proposed recently, this flourishing,

cross-area research field still remains relatively less structured than its front-end

constituent part, stereo correspondence. Moreover, so far little work has been done to

assess different stereo-based view synthesis algorithms, particularly when real-time

execution is enforced as a hard application constraint. In this paper, we first propose a

unified framework that seamlessly connects stereo correspondence and view synthesis.

The proposed framework dissects the typical algorithms into a common set of individual

functional modules, allowing the comparison of various design decisions. Aligned with

this algorithmic framework, we have developed a flexible GPU-accelerated software

model, which contains optimized implementations of several recent real-time

algorithms, specifically focusing on local cost aggregation and image warping modules.

Based on this common software model running on graphics hardware, we evaluate the

relative performance of various design combinations in terms of both view synthesis

quality and real-time processing speed. This comparative evaluation leads to a number

of observations, and hence offers useful guides to the future design of real-time stereo-

based view synthesis algorithms.

& 2008 Elsevier B.V. All rights reserved.

1. Introduction

These years, there is a significant research interest instereo-based view synthesis. Stereo-based view synthesisgenerally estimates a depth map from a given image inputpair, while consequently using this information as ageometrical proxy to generate any desired novel inter-mediate view. This technology draws a lot of attentionbecause of multiple reasons. Synthesizing multiple novelintermediate views can drive future 3DTV applications

ll rights reserved.

EC, Kapeldreef 75,

16 281515.

gmans),

e (P. Bekaert),

such as autostereoscopic displays, which allows a viewerto perceive natural 3D imagery without the need ofspecial glasses. More practically, the depth-based repre-sentation of the scene geometry is consistent with MPEGtechnology [19]. Furthermore, stereo-based view synth-esis approaches can lead to appealing synthetic imageswith high perceptual quality [32], and still maintain real-time performance [29]. Therefore, the scope of ourresearch is focused on real-time stereo-based viewsynthesis algorithms, using two-frame input.

Dense depth maps from two-frame input are obtainedby stereo correspondence algorithms, and have beenintensively studied in the past years. The high-impactwork of Scharstein and Szeliski [17] boosted this researchby a well-defined taxonomy, as well as a systematicevaluation of these algorithms. However, their work only

www.sciencedirect.com/science/journal/image

www.elsevier.com/locate/image

dx.doi.org/10.1016/j.image.2008.10.005

mailto:[email protected]




ARTICLE IN PRESS

S. Rogmans et al. / Signal Processing: Image Communication 24 (2009) 49–6450

concerns stereo matching, but these fundamental tasks—

i.e. stereo correspondence and view synthesis—should betreated in a unified framework, when the ultimate goal isphotorealistic view synthesis. Therefore it still remainsunclear on how to dissect the view synthesis back-endof this system into standard components that seamlesslyconnect the front-end stereo correspondence modules(see Fig. 1). Hence, it is also not clear how well the existingstereo correspondence algorithms perform in such a viewsynthesis system, especially under the real-time con-straint. Due to the necessity of the real-time performance,local stereo correspondence algorithms are still thedominant design choice, where the emphasis is put oncost aggregation. Fig. 1 represents the design space of astereo-based view synthesis system, wherein previousrelated work is accurately pinpointed. Gong et al. [7]compared and evaluated different cost aggregation ap-proaches—as an extension of the preliminary work in[26]—when these approaches are integrated into acommon real-time platform. Although many observationshave been obtained from this research, offering usefulguides to design real-time stereo matching algorithms,the relative performance of these algorithms for novelview synthesis remains to be investigated. From aperspective of GPU-accelerated software frameworks,OpenVIDIA [4] and GPUCV [3] provide optimized GPUimplementations for elementary vision tasks, but they arenot specialized in stereo correspondence and view synth-esis specifically. In this extent, the framework of Gonget al. [7] does not include any view synthesis relatedprocessing modules whatsoever. The main research goals

(1)

Cost Computation

?(2) (3) (4) ... ...

Stereo Correspondence

Cost Aggregation

Disparity Selection

Disparity Refinement

GPU Architecture

Fung et al., 2005Farrugia et al., 2006

Wang et al., 2006Gong et al., 2007

Scharstein and Szeliski, 2002

?

View Synthesis

Fig. 1. A schematic annotation of the notable research work in the design

space of a stereo-based view synthesis system. The relatively unstruc-

tured view synthesis part is indicated with question marks.

Cost Computation

Cost Aggregation

Disparity Selection

Disparity Refinement

Dep

Stereo Correspondence Modules

Stereo Input

Fig. 2. The unified stereo-based view synthesis framework proposed in this pap

frontend) are the typical modules for view synthesis (the backend).

and contributions of this paper can be summarized asfollows:

(1)

th Pr

er. C

To propose a unified framework that dissects typicalstereo-based view synthesis systems into a set ofchained algorithmic building blocks, as an extensionof the ideas inspired by Scharstein and Szeliski [17](cf. Section 2).

(2)
To develop a flexible, GPU-accelerated software modelwith implementations of recent real-time algorithmiccomponents (cf. Section 3). Furthermore, we optimizethe implementations specifically in the contextof stereo correspondence and view synthesis, incontrast to other more generic GPU frameworks [3,4](cf. Section 4).
(3)
To perform a comparative evaluation of these differentreal-time stereo-based view synthesis algorithms on acommon GPU platform (cf. Section 5).
The paper ultimately concludes, and discusses possiblefuture work in Section 6.

2. A unified framework for stereo-based view synthesisalgorithms

Regarding stereo matching and view synthesis as partsof a complete system, we present a unified framework fortypical stereo-based view synthesis algorithms in thissection. Extending the established taxonomy of densetwo-frame stereo correspondence algorithms [17], theproposed algorithmic framework focuses on the dissec-tion and comparison of the back-end part of the system.Similar to the stereo algorithm taxonomy, our objective isto structure and assess the typical design decisions forindividual algorithmic components in the view synthesispart.

As depicted in Fig. 2, the unified framework furtherextends the building blocks of stereo correspondencealgorithms, by adding a consecutive chain of processingmodules for view synthesis. There are several advantagesassociated with this unified framework. First, it can lead toa better understanding of the diversified technologies inthis stereo-based view synthesis field. Second, moreinsights can be gained by examining various implementa-tions of one particular functional component, while fixingthe design decisions made for the rest of the modules.Lastly, as with the Middlebury stereo software model, we

Inverse Warping

Hole Detection

HoleHandling

Forward Warping

IntermediateView

oxy

PATH 1:Forward Warping

Mode

PATH 2:Inverse Warping

Mode

View Synthesis Modules

Path 1

Path 2

Holes

onnected to the identified stereo correspondence modules [17] (the

ARTICLE IN PRESS

s

D s

1- s*I 0

*I 0

*I 1

*I 1I s

D 0

1-s*I 0

*I 0

*I 1

*I 1I s

Fig. 3. High-level processing procedures of (a) the forward warping

mode which uses the left (right) disparity map D0 (D1), and (b) the

inverse warping mode that uses a view-dependent disparity map Ds .

S. Rogmans et al. / Signal Processing: Image Communication 24 (2009) 49–64 51

implemented a GPU-accelerated, module-based softwaremodel, which is closely tied to this unified framework.This model also facilitates including other algorithms inthe future. Before pinpointing the new functional modulesfor the view synthesis part, we will first have a briefreview of the existing stereo correspondence modules.

2.1. Stereo correspondence

Following the taxonomy of [17], stereo correspondenceis partitioned into four steps. The matching cost computa-

tion is a function F that inputs an RGB-colored left/rightimage pair I�0ðx; yÞ and I�1ðx; yÞ, along the x and y spatialdimensions. Based on a given disparity search range S, thecost function F generates a 3D cost volume Cðx; y; dÞ

according to

Cðx; y; dÞ ¼ FðI�0ðx; yÞ; I�1ðx� d; yÞÞ, (1)

where the disparity hypothesis d 2 S. Various types ofdifferent cost functions have already been proposed.

The next step is cost aggregation, where the raw costCðx; y; dÞ is aggregated in a local area—called the supportregion—of the 3D cost volume. The support area is in themost general case a 3D domain inside the cost volume,thereby the aggregation is best described as a 3Dconvolution with kernel wðu; v;oÞ following

Aðx; y; dÞ ¼ wðu; v;oÞ � u � v � oCðxþ u; yþ v; dþoÞ, (2)

where each � denotes one dimension of the convolution.However, this general case is often simplified by onlyusing a 2D convolution at a fixed disparity hypothesis,favoring fronto-parallel instead of slanted surfaces.

After the cost is aggregated and is more robust, thedisparity selection selects the best disparity Dðx; yÞ from allhypotheses in the disparity search range S. In contrastto global algorithms, local algorithms heavily rely on costaggregation for a good stereo estimation performance.A local winner-takes-all (WTA) strategy that selects thebest disparity, is then usually applied according to

Dðx; yÞ ¼ arg mind2S

Aðx; y; dÞ. (3)

This makes selecting the best disparity pixel-wise in-dependent, which can be significantly accelerated by thedata-parallel architecture of the GPU.

As a last step, the disparity refinement is an optionalpost-processing stage that does just that—refining thegenerated disparity map Dðx; yÞ. Similar to the otherprocessing modules, this step also comes in a varietyof ways. In real-time implementations however, low-complexity filtering or no refinement at all is generallyfavored.

2.2. View synthesis

At the highest level of abstraction, view synthesis canbe seen as generating a novel intermediate image fromthe input images, based on an estimated depth map, andfilling in the occlusions and holes inevitable during theview interpolation [17,23]. Most typically, for two-framestereo input, a novel view can be synthesized in two

different ways, i.e. following either the path of forward

warping, or the path of inverse warping in Fig. 2.As illustrated in Fig. 3, the key difference between

the forward warping and the inverse warping mode, lies inthe fact whether a view-dependent disparity map Ds isestimated for each requested viewpoint s, or a fixeddisparity map D0 (or D1) is reused. Though there aredifferent pros and cons to either approach [17], these twoview warping modes have one common hole handling

module in the final stage, which is indispensable tophotorealistic view synthesis. For the inverse warpingmode, an additional hole detection module also exists,explicitly accounting for the visibility difference betweentwo input stereo images.

2.2.1. Forward warping

Like any other processing module that builds up astereo-based view synthesis algorithm, forward warpingcomes in different varieties. Given an image (either left orright) and its associated map, a collection of teeny-tinytriangles can be created, and then the novel image isrendered with the constructed 3D surface and the originalimage as a texture map [14,22]. Splatting [28] is anotherpopular rendering algorithm, which projects each pixel tothe location in the novel image. In a more generalizedform, the projected area of the warped pixel (or splat size)can be properly computed [18]. Despite that a linesegment appears a better splat unit to address gaps [17],a single-pixel splat is more commonly used, especiallybecause of its efficient implementation on GPUs [12].

No matter which variation is implemented, theforward warping module always exhibits the same inputand output characteristics, and processing concept. Asinput, the warp uses one of the input images—e.g. the leftimage I�0ðx; yÞ—and the joint disparity map D0ðx; yÞ. At anabstracted level, the forward warping processes thisinformation by projecting the input image color to thecorrect location inside the novel image plane. Since thisoperation is in general not a one-to-one mapping, thisunavoidably leaves uncolored holes inside the synthesizedimage output.

One of the greatest advantages of the forward warp isthat it always uses either D0ðx; yÞ or D1ðx; yÞ whenprojecting from the left or right view, respectively.Therefore, the warping operation is able to reuse thedisparity map when multiple novel viewpoints need to besynthesized. On the other side, the synthesis quality offorward warping may suffer from discretization effects.For plausible (artifact-free) view rendering, non-uniforminterpolation [13] that affects real-time speed is involved.

ARTICLE IN PRESS


In addition, a pixel-dependent back-to-front depth order-ing has to be enforced.

2.2.2. Inverse warping

Different from the forward warping approach, theinverse warping does not need to transfer the estimatedgeometry (i.e. D0 or D1) from the edge viewpoints to thenovel viewpoint s. But instead, a dense disparity map Ds isconstructed with the aligned image lattice to the desirednovel view Is. In this extent, the matching cost function F

specified in Eq. (1) is adapted to a more generalized form,according to

Csðx; y; dÞ ¼ FðI�0ðxþ sd; yÞ; I�1ðx� ð1� sÞd; yÞÞ, (4)

yielding a projected disparity map Dsðx; yÞ for an inter-mediate viewpoint s. In essence, the inverse warpingmode can be regarded as a two-frame variant of moregeneralized multiple-frame plane-sweep techniques[24,20], or a simplified version of layered depth imagesfor the novel view geometry representation [18].

Particularly for view synthesis tasks, inverse warping isa common practice [5,12,13,29]. One advantage of thisapproach is that it enables linear resampling of the inputstereo images, hence the interpolation quality is improvedowing to the sampling precision. Another importantadvantage is that the processing is highly data-parallel,and very friendly toward streaming architectures, such ase.g. GPUs. However, the challenge for the inverse warpingapproach is to appropriately decide whether a pixel in thenovel view is visible or not in the input images I�0 and I�1.Additionally, the generated projected disparity map Ds

normally can only be applied to synthesize the inter-mediate image Is. Otherwise, the image quality willdegrade because of two resampling steps [12,18]. As aresult, this forces the synthesis algorithm to recompute adisparity map for each intermediate image, when multiplenovel viewpoints are requested.

2.2.3. Hole detection and hole handling

For high-quality view generation, an image synthesisalgorithm is desired to tackle all disturbing visual artifacts,resulting from the previous image warping step. Thisspecifically means that for the synthesized image fromthe forward warping path, the pixels exposed as projectionholes in the novel view have to be appropriately filled withcolor. In contrast, for inverse warping, the ‘ghosted’ regions[17] have to be coped with, which are caused by blendingthe projections of two different scene points seen by the leftand right images. However, unlike the forward warpingholes, these problematic ghosted regions are not explicitlyexposed in the inverse warping process. For this reason,hole detection as an additional functional module isattached to the inverse warping path. In this paper, weopt for a more generalized term holes to denote a variety ofvisual artifacts for both view warping approaches. Ingeneral, geometrical occlusion is the major cause that givesrise to the holes. Except for some simple view interpolationalgorithms [29], various implementations of hole detectionactually exist to enhance the synthesis quality of inversewarping. For instance, Strecha et al. [20] formulated a set ofvisibility maps in a Bayesian framework, while Lu et al. [12]

proposed a lightweight hole detection method for GPUs,based on photometric cross-checking.

Up to this point, both warping approaches exportconceptually the same synthesized image with holes tothe next module, i.e. hole handling. Ideally, only the right(or left) image will be used to fill the holes for the forwardwarping output. However, because forward warpingcan suffer from erroneous projections due to inaccurategeometry information, and occlusions are also not theonly source causing visual artifacts, a unified holehandling module exploiting two source images is nowapplicable to both approaches. The key tasks happening inthe hole handling module are basically two-fold. First, themissing disparity values for the identified holes have tobe inferred. Second, a correct source image should bedecided, where the colors for the holes should be fetchedusing the inferred disparity values. There are differentmethods existing for hole handling. Recently, severalaccurate but complex occlusion modeling and handlingschemes have been proposed [2,21], and they are typicallyintegrated into a joint optimization framework withstereo matching. In this case, hole handling becomes atrivial post-processing module, where the estimatedocclusion maps can be loaded for proper handling.Instead, other view synthesis algorithms explicitly tacklethe holes as a stand-alone post-processing step, e.g. Zhanget al.’s work [31] and Lu et al.’s work [12]. The advantageof these approaches are that they are more oriented tooptimize the ultimate view synthesis quality, not directlyconcerning the occlusion modeling accuracy, so the real-time execution speed of the stereo correspondence partremains feasible.

3. Implementation

Following the previously defined unified framework,we implemented a software model to cover the entiredesign space of stereo-based view synthesis algorithms oncommodity GPUs. Since real-time stereo corresponden-ce—more specifically, cost aggregation—is a critical partof this processing chain, we mainly focus on the impactof different variations of the aggregation module, withregards to the end result of view synthesis. For the properevaluation of these aggregation modules in the context ofview synthesis, a set of well-considered baseline imple-mentations for both the forward warping and inversewarping method are presented.

3.1. Generic software model

The software model we have developed is able toimplement and evaluate stream-centric processing ingeneral, although we present its structure in the contextof real-time stereo-based view synthesis. As depicted inFig. 4, the model is compromised out of three hierarchicallayers. Described in bottom-up order, these are:

(1)
The computational core that contains all specific low-level GPU optimized kernel code in the high-levelshading language (HLSL).

ARTICLE IN PRESS

Kernel Code

FUnFU3FU2FU1

Matching Cost

Cost Aggregation Approaches

Disparity Selection Hole Handling

...

High-Level Parameters...

Appr. 1 Appr. 2 Appr. m...

Kernel Code

Kernel Code

LAYER 3:Functional Chain

LAYER 2:Functional Units

LAYER 1:Computational Core

Fig. 4. The generic software model exists out of three layers. Each

functional unit can instantiate different approaches, exposing different

low-level kernel code.


(2)
To manage the interface between abstract high-levelcode and architecture specific low-level code, func-
tional units have been collected in a separate layer.They have a one-to-one mapping with the processingmodules proposed in the unified framework, but caninstantiate different implementations.

(3)
A functional chain ultimately links and coordinates thefunctional units to form a complete stream-centricapplication, in this case being real-time stereo-basedview synthesis algorithms.
We present the implementations of all modules that areinvolved with the application, giving special attention tothe cost aggregation module and the baseline implemen-tations for forward and inverse warping.

3.2. Matching cost computation, disparity selection and

disparity refinement

For the matching cost, we have adopted one of themost common approaches, being the truncated absolutedifference (TAD). The absolute difference of each colorchannel is hereby first computed and integrated accordingto the following dot product

CADðx; y; dÞ ¼ g � jI�0ðxþ sd; yÞ � I�1ðx� ð1� sÞd; yÞj (5)

with a grey-scaling vector g ¼ h0:299; 0:587; 0:114i. Ithas generic support for both warping methods, but s is setto zero in case the forward warping method is active. TheTAD cost function truncates the absolute difference cost toa maximum value t, where 0ptp255 (corresponding to8-bit intensity images) and is given by

CTADðx; y; dÞ ¼255

t �minðCADðx; y; dÞ; tÞ. (6)

For the disparity selection, we currently only use WTA asdescribed in Eq. (3). Thanks to its data-independentnature, its capability is highly optimized inside the GPUarchitecture. We plan to further investigate other globaloptimization methods such as scanline optimization,belief propagation and dynamic programming, sinceresearch such as [27] proves the high potential and real-time capabilities of these approaches as well.

As an optional post-processing step, we have imple-mented the disparity refinement by a 3� 3 medianfilter that can either be enabled or disabled, similarto [7].

3.3. Cost aggregation modules

As the most critical part in real-time stereo correspon-dence algorithms, the cost aggregation exists in variousflavors and has been intensively studied by Gong et al. [7],in terms of resulting depth quality. Based on this research,we have selected the cost aggregation approaches withthe highest potential to generate good quality-complexitytrade-offs for real-time stereo-based view synthesisalgorithms. The most promising modules are the squarewindows [17], shiftable windows [17], boundary-guidedaggregation (BO) [6], and the adaptive weights (AW) [30]approach. This leaves out the GPU implementations oforiented-rod [9] and adaptive window [25], due to theirhigh complexity and relatively low depth quality output.Additionally, we do include a newly proposed truncatedwindows (TR) scheme [10].

3.3.1. Square windows

The square windows approach computes an aggrega-tion cost ASQ , which is commonly implemented with a boxfilter over a square windows with support size 2r þ 1.The box filter can be seen as a separable 2D convolu-tion with the kernel wðu; v;oÞ in Eq. (2) all set to one,leading to

A0SQ ðx; y; dÞ ¼1

r þ 1

Xr

u¼�r

Cðxþ u; y; dÞ, (7)

ASQ ðx; y; dÞ ¼1

r þ 1

Xr

v¼�r

A0SQ ðx; yþ v; dÞ, (8)

where A0SQ is a temporary horizontal cost.

3.3.2. Shiftable windows

Shiftable windows will generally use a constantwindow size, but anchored at different places instead ofthe center pixel. The effect of shifting windows over adistance t in each direction, can be achieved by running aminimum-filter consequently to the square windowsoutput. Since minimum-filtering also exhibits separabil-ity, it is implemented as

A0SHðx; y;dÞ ¼ minu2½�t;t�

ASQ ðxþ u; y; dÞ, (9)

ASHðx; y;dÞ ¼ minv2½�t;t�

A0SHðx; yþ v; dÞ, (10)

with A0SH storing the temporary result. Similar to theimplementation of [7], we set t ¼ r.

3.3.3. Boundary-guided aggregation

The aggregation of this approach [6] can be guided by acolor segmented map of the left view, or by firstpreprocessing the image to obtain boundary information.Since color segmentation is very hard to obtain inreal-time, the second option is favored for real-time

ARTICLE IN PRESS

Fig. 5. (a) The left view I�0 of the Teddy scene. (b) The boundary map of

the guided aggregation, with red (green) indicating the horizontal

(vertical) intensity boundaries. (c) The selection map of the truncated

windows, with black, gray and white indicating a full, rectangular and

elementary window, respectively. (d) The first set of horizontal adaptive

weights, where dark pixels indicate a high similarity and vice versa.


solutions [7]. The preprocessing can be summarized infour steps:

(1)
A 3� 3 Gaussian smoothing is performed. (2) Consequently, a symmetric neighborhood filter [8] is
iterated n times. Equal to the implementation used by[7], we set n ¼ 4.

(3)
The image gradients of the filtered output in both thehorizontal and vertical direction are generated.
(4)
A non-maximal suppression is performed to thin theedges, and to enhance the remains.
The preprocessing only has to be executed once, and theoutput for Teddy is shown in Fig. 5b. We inverted the blackbackground to white for clarity, while the red and greenlines indicate horizontal and vertical edges, respectively.

Based on the boundary map produced by the preproces-sing, the aggregation can be efficiently guided, to avoid goingover any edges, significantly reducing the possibility offoreground fattening. The aggregation is performed byiterating over a 5� 5 separable convolution kernel. The 1Dhorizontal kernel comes in three different forms wl ¼

15h0; 0;

1; 2; 2i, wr ¼15h2; 2; 1; 0; 0i and wf ¼

15h1; 1; 1; 1; 1i. If

there are any edges at the left or right side of the center pixel,kernel wl, respectively, wr is used. If there are no detectededges, or edges appear on both sides, kernel wf is used. Usingwf in case of edges at both sides, helps to remove artifactsthat can occur due to over-segmentation of the preprocessing.The vertical aggregation is performed in a similar manner. Ifthese aggregation passes are iterated k times, the resultingwindow has a size of 4kþ 1 pixels.

3.3.4. Truncated windows

Truncated windows [10] divide a square aggregationwindow of size 2r þ 1 in four elementary parts which

span each quadrant of the full aggregation window—i.e.an upper-left (UL), upper-right (UR), bottom-left (BL) andbottom-right (BR) part. Then, each elementary windowis combined with a convolution that uses a separableapproximated Laplacian distribution function with fall-offrate g ¼ 0:5, according to

wðu; vÞ ¼ e�ð1=gÞjuj � e�ð1=gÞjvj, (11)

where u, v is the relative distance horizontally andvertically to the center pixel. The convolution kernel issimilarly truncated into four kernels, and the aggregationAULðx; y;dÞ from the upper-left window, can be computedaccording to

AULðx; y;dÞ ¼wULðu; vÞ � u � vCðxþ u; yþ v; dÞ

mUL

, (12)

where mUL is a constant to normalize the aggregatedwindow cost. The other costs AUR, ABL and ABR arecomputed in a similar manner.

While these aggregated costs can be used in a four-window approach where the minimum cost out of four isselected, the aggregation scheme is usually extended to anine-window approach using three different windowsizes. Next to the four elementary windows, a left (L),upper (U), right (R) and bottom (B) rectangular window iscomputed. The complexity of computing these rectangu-lar windows is very low, as follows

ALðx; y; dÞ ¼12 � ðAULðx; y; dÞ þ ABLðx; y; dÞÞ, (13)

while AU , AR and AB are all derived similarly. The thirdwindow size spans the full aggregation window, and itscost MF is computed in an identical manner, by summingand normalizing AU and AB, or AL and AR.

When the minimum of each window set is known, biasterms a ¼ 0:04 and b ¼ 0:02 are assigned to favor largewindows, similar to [25]. Hence, the final aggregated costused for the WTA disparity selection is described by

MEðx; y; dÞ ¼ minfAUL;AUR;ABL;ABRg, (14)

MRðx; y; dÞ ¼ minfAL; AU ; AR; ABg, (15)

ATRðx; y; dÞ ¼ minfME þ a;MR þ b;MFg, (16)

where Eq. (16) selects between the best window size.An example for the Teddy scene is given in Fig. 5c. Blackindicates a full window, gray a rectangular, and white anelementary window, that produces the minimum cost.

3.3.5. Adaptive weights

Aggregating with adaptive weights [30] performs aconvolution with a kernel size 2r þ 1, which uses a per-pixel different convolution kernel—i.e. the adaptiveweights. In the original implementation of [30], a non-separable 2D convolution is used. Since this complexity isOðr2Þ, Gong et al. [7] proposed to simplify the approach byusing a separable convolution with complexity OðrÞ.We adopt this simplification, which starts by generatingthe adaptive weights in a preprocessing step. Since theadaptive weights are based on the gestalt principlesof similarity and proximity, the per-pixel horizontalconvolution kernel wHðx; y;uÞ is built out of two terms,

ARTICLE IN PRESS

Fig. 6. A sample of the forward warping results. (a) The resulting

forward warped Teddy image for the center viewpoint, with the holes

shown in yellow. (b) These holes are further classified into different

categories in the baseline hole handling approach. Red (blue) pixels: near

depth discontinuity filled with the colors from the left (right) image.

Yellow pixels: not obviously caused by geometric occlusions (e.g. due to

sub-pixel mis-registration). Green pixels: belonging to the image border

regions.


according to

wHðx; y;uÞ ¼ e�ð1=gsÞjDcxu j þ e�ð1=gpÞjuj, (17)

with jDcxuj being the Euclidean color distance betweenI�0ðx; yÞ and I�0ðxþ u; yÞ, and gs ¼ 17:6, gp ¼ 40:0 being afall-off rate for the similarity, respectively, proximity term.The values of the first set of horizontal weights wHðx; y;1Þto wHðx; y;4Þ are packed in the RGBA color channels, andvisualized in Fig. 5d. The image was inverted for visualclarity.

The disadvantage of this approach is that a lotof memory is needed to store all the weights, if theprocessing is to be performed in parallel. If all weights aregenerated (including the vertical ones wV ðx; y; vÞ), theaggregation results in convolving according to

A0WEðx; y; dÞ ¼wHðx; y;uÞ � uCðxþ u; y; dÞ

mHxy

Þ, (18)

AWEðx; y; dÞ ¼wV ðx; y; vÞ � vA0WEðx; yþ v; dÞ

mVxy

, (19)

where mHxy, mVxy

are normalization constants, and A0WE storesthe intermediate result of the horizontal aggregation.

3.4. View synthesis baseline solutions

Following the unified framework defined in Section 2, aset of baseline implementations is presented for both theforward and inverse warping method, and the unified holehandling. These implementations are well considered, andboth the forward and inverse warping method haveconsistent input and output characteristics. Given thesecircumstances, the baseline solutions enable a compara-tive evaluation of various cost aggregation methods, as acritical part of real-time stereo-based view synthesisalgorithms.

3.4.1. Forward warping method

Our baseline implementation of the forward warpingmodule uses a single pixel as the projection unit, incontrast to other approaches that use e.g. line segments[17]. It takes in two input elements—i.e. the source viewI�0ðx; yÞ and the generated depth map D0ðx; yÞ. Conse-quently, the novel intermediate image is synthesizedaccording to the following per-pixel pseudo code:

FOR ALL (x,y):

for(int dm ¼ 0; dm o s*S; dm++)

if( D0(x+dm,y)*s ¼ ¼ dm)

Is(x,y) ¼ I0(x+dm,y);

With dm being the estimated motion parallax (i.e. thescaled disparity), a pixel ðx; yÞ in the novel viewpointIsðx; yÞ is filled with color from the source image I�0ðxþdm; yÞ whenever D0ðxþ dm; yÞ � s ¼ dm. This is equivalent tosearching on a scanline in the disparity map, and testingwhether pixels map to the current location, given theposition s.

As previously mentioned, it is possible for multiplepixels from the source image to map to the same pixellocation in the novel image. By using the pseudo codepresented, in such a case, the previous color gets over-

written by color that corresponds to a larger disparity.This approach is known as the occlusion compatible

warping order, since large disparities indicate foregroundobjects, which occlude background objects with a smalldisparity. To save computational complexity, the searchfor these mappings is restricted to the range s � S, as it isof no use to search for larger disparities than those thatcan occur by the rules of motion parallax. As a result ofthis operation, holes still remain in the synthesized view(see Fig. 6a). As depicted in Fig. 6b, a hole map Hðx; yÞ canbe automatically generated after the forward warping.Initially, we set Hðx; yÞ ¼ 1 for all ðx; yÞ, and demask it incase of a successful pixel mapping.

To allow the forward warping and inverse warpingmethod to use a unified hole handling scheme, thedisparities D0ðxþ dm; yÞ are—with minimal additionaleffort—stored as Dsðx; yÞ ¼ D0ðxþ dm; yÞ, to obtain a pro-jected disparity map equivalent to the one needed by theinverse warping method. The estimated disparity map canconsequently be used in a unified manner, to tackle theholes that explicitly remain in the novel synthesizedimage.

3.4.2. Inverse warping method

The path of the inverse warping method involvesan inverse warping and hole detection module, whereasthe inverse warping module takes in three inputelements—the source views I�0ðx; yÞ and I�1ðx; yÞ, and theprojected disparity map Dsðx; yÞ. Since forward warpingcan occur, using either the left or the right input image,the input of both the forward and inverse warping methodcan be consistently defined as the input image pair and agenerated disparity map Dðx; yÞ. Whereas the forwardwarping module uses only either one of the input images,it will consequently have to use the both views for theconsecutive hole handling module.

Unlike the forward warping, the inverse warping is aone-to-one pixel mapping that performs a color lookup—

for pixel ðx; yÞ—from the left view I�0ðx0; yÞ withx0 ¼ xþ sdl, or equivalently from the right view I�1ðx1; yÞ

with x1 ¼ x� ð1� sÞdl, when the estimated disparitydl ¼ Dsðx; yÞ. To achieve the most qualitative results, bothcolor lookups are used in a disparity-compensated

ARTICLE IN PRESS

Fig. 7. A sample of the inverse warping results. (a) The synthesized

center view without hole detection and handling, resulting in notorious

visual artifacts. (b) The detected artifacts by the hole detection module.

The color coding scheme is the same as in Fig. 6b.


blending according to

Isðx; yÞ ¼ ð1� sÞI�0ðx0; yÞ þ sI�1ðx1; yÞ. (20)

Although the inverse warping generates a completelycolored synthesized intermediate view, the result stillcontains a significant amount of disturbing artifacts(see Fig. 7a). This effect is mainly due to the lack of(or corrupted) depth information in the dense disparitymap, or unresolved visibility difference from the twoinput views [12]. These areas are therefore best detectedand handled properly in the hole detection. As a baselinesolution for the hole detection module, we proposeintegrating a photometric consistency check betweenthe left and right color lookup, which integrates theindividual absolute color differences according to the dotproduct

Hðx; yÞ ¼ g � jI�0ðx0; yÞ � I�1ðx1; yÞj4l? 0:1, (21)

and where Hðx; yÞ forms a binary mask—with the resultsof the photometric consistency check—indicating theholes. The term l ¼ 20 is the color threshold. Since thistest is based on photometric consistency, various visualartifacts can be identified in Hðx; yÞ, whether they are theresults of genuine occlusions, or are the results of mixedpixels near object boundaries [32].

3.4.3. Unified hole handling

After an intermediate view Isðx; yÞ is synthesized by theproposed baseline solutions, the image still contains holesidentified by a binary mask Hðx; yÞ. The task of holehandling can be seen as tackling two problems—i.e. howto infer the missing depth information, and decidingwhich input image is used for the color lookup. Ourproposed baseline implementation makes a differencebetween a left border area that has a width of ð1� sÞ � Bw

(where Bw is an empirical border width in proportion tothe search range S), a right border area that has a width ofs � Bw, and the remaining center area that covers thelargest part of the synthesized image.

To infer the missing depth from the center area, threedepth values on both the left and right side of the hole aresampled. The confident values—i.e. the hole mask Hðx; yÞ

indicates them as a non-hole—are used to compute aweighted average dwL

and dwR, for the left and right side,

respectively. If these two depth values pass a gap testjdwL� dwR

j4d—where d is set to 2 as the definition ofdepth discontinuity [17]—it is an indication that the hole

originates from a genuine occlusion. In this case, thebackground disparity (i.e. the smallest value) is chosen toinfer the missing depth. If the left disparity dwL

is thebackground, this indicates an occlusion from the right, sothe left image should be chosen for the color lookup, andvice versa. However, if dwL

and dwRdo not pass the depth

gap test, the hole is filled with the depth value and jointcolor lookup that closest matches a weighted coloraverage—similar to inferring the missing depth—aroundthe left and right side of the hole.

In all border areas, the missing depth is inferred bypadding confident depth values contiguous to the border,from the center area. For the left and right border area, theleft, respectively, right input image is used to perform thecolor lookup with the padded depth. The border width isempirically set for a given scene. For Tsukuba and Venus

[17], we have used Bw ¼ 20, as for Teddy and Cones [17] weset Bw ¼ 50.

4. Real-time optimization schemes

The real-time acceleration of stereo-based view synth-esis algorithms is obtained by exploiting the GPU forgeneral-purpose computations (GPGPU). We implementGPGPU optimizations that apply specifically to stereocorrespondence and view synthesis algorithms, unlikemore generic GPGPU frameworks such as OpenVIDIA [4],and GPUCV [3], which optimize elementary vision kernels.Furthermore, the applicability of the optimizationschemes through next-generation GPGPU APIs, such asthe Compute Unified Device Architecture (CUDA) or Brook+,is also discussed. Unfortunately, these frameworks do notsupport exploiting all hardware available on the graphicscard, and are currently still vendor-specific—i.e. CUDAruns on NVIDIA and Brook+ on AMD ATI hardware. Thispaper therefore focuses on a uniform implementationusing solely the graphics API, which is supported on allcommodity GPUs. Opposed to conventional GPGPU [16],we drastically lever the utilization of the graphics hard-ware, and therefore outperform many state-of-the-artimplementations considering execution speed. Hence, itmakes our software model more generically applicable,and allows to perform more convincing speed tests.

4.1. Stereo correspondence optimization schemes

Following the taxonomy of [17], we propose optimiza-tion schemes for the cost computation, cost aggregation,and disparity selection. During the matching cost compu-tation, four disparity estimations are packed into theRGBA channels of a video texture, to harness the powerfulSingle Instruction Multiple Data (SIMD) vector processingcapabilities of the GPU. As a result, the consecutivemodules can read four disparity estimations with a singletexture operation, while performing all related computa-tional operations in parallel. Although next-generationGPGPU APIs use individual scalar processors for significantgeneral-purpose acceleration in the newest graphicshardware, SIMD is still transparently enforced on multiple‘threads’ per multiprocessor. Accessing the hardware

ARTICLE IN PRESS

Fig. 8. By writing the binary mask Hðx; yÞ in the depth buffer during the

image warping, the kernel that handles the holes of the center area can

be efficiently invoked by rendering a centralized rectangle at Z ¼ 0:5.

Both image borders can be handled in a similar manner.


through a graphics API will therefore automatically batch4-component wide SIMD operations on these multi-processors. Hence, when considering traditional GPGPUthat uses a graphics API such as Direct3D or OpenGL, thisoptimization scheme still approximately leads to a 4-foldspeedup over the use of a single channel.

For the matching cost aggregation, two optimizationschemes can be used. Our first optimization schemeaccelerates convolution kernel coordinate calculationsfrom the aggregation window. In our implementations,we have ported these computations from the fragment tothe vertex processing stage, by setting them as texturecoordinates. This allows us to exploit the linear inter-polators through the rasterizer, which leads to a significantspeedup as indicated in our previous research [11]. Sincethe linear interpolators are graphics specific hardware,and are not performed on the main shader (i.e. the scalar)processors, the utilization of these resources is moremanageable through a graphics API. The second optimiza-tion scheme for the cost aggregation uses bilinearsampling to reduce the amount of memory reads by half,and therefore improving the arithmetic intensity of thekernel. As memory reads are often the bottleneck in data-parallel applications, the execution speed will be drasti-cally accelerated. Since next-gen GPGPU also incorporatesa texture memory space, this optimization scheme canconveniently be applied in both the traditional and next-gen paradigms.

The disparity selection is optimized by exploitingthe depth test that occurs at the end of the pipeline.The minimal matching cost is written to the depth bufferinstead of rendering it to an off-screen video texture.In this extend, the integrated depth test can be exploitedto compare and save the minimal WTA matching cost andjoint disparity value. This scheme avoids a costly render-ing pass, and avoids the need to store the entire costvolume in advance. The depth test hardware is embeddedinside the raster operators, which is also graphics specifichardware. Since this hardware is currently unexposedthrough next-gen GPGPU APIs, this optimization schemecan therefore only be applied through the use oftraditional GPGPU.

4.2. Fast view synthesis using computational masks

The number of holes in a novel intermediate view isoften very small compared to the total amount of pixelscontained within the image. We therefore propose to usean optimization scheme that allows invoking the kernelsof the occlusion handling only for the required pixels,functioning as computational masks. In this optimizationscheme, the Early-Z mechanism of modern graphicshardware is exploited, which allows to reject kernelinvocations based on depth values inside the depth buffer.

The optimization scheme is illustrated in Fig. 8, wherethe binary mask Hðx; yÞ is written to the depth buffer Z ofthe graphics hardware, simultaneously with the inter-mediate view warping. When the hole handling is started,the center area is processed by rendering a rectangle onlyover this area. By explicitly disabling manual depth writes

inside the fragment processing, the Early-Z mechanism isactivated. When the rectangle is rendered at Z ¼ 0:5(or any value between 0.0 and 1.0 for that matter), kernelinvocations for the non-holes are automatically rejectedbecause the graphics hardware is cleverly ‘fooled’.Namely, the hardware will assume that the kernelgenerates a pixel which lies behind existing renderedgeometry, preventing the pixel to be rendered to thescreen. Hence, the hardware will auto-reject the kernelinvocation, as it will have no impact on the final result.A similar approach is applied to both the rectangles thatcover the left and right border.

This scheme closely matches the proposed holehandling. In reality, the gain of execution speedup isdetermined by the amount of occluded pixels and more-over by the sparseness of these holes, since the GPU isonly able to reject groups of 4� 4 pixels. As thisoptimization scheme also makes use of the depth buffer,it can only be used in the context of traditional GPGPU.

In case of very sparse hole maps, the use of the CPUcould potentially lead to a faster implementation, cer-tainly when the holes are compacted and accelerated withthe Intel performance primitives (IPP) or the Intel streaming

SIMD extensions (SSE) in general. However, due to the datalocality constraint—i.e. the data has to be sent from videoto system memory and back—an optimal GPU implemen-tation will generally lead to the maximum overallexecution speed, when all other kernels are also runningon the graphics hardware.

5. Performance evaluation

Based on the developed real-time stereo-basedview synthesis framework, we performed a number ofexperiments to investigate the overall view synthesisperformance, using different combinations of stereocorrespondence algorithms and view synthesis modespresented in Section 3. To make the results widelyreproducible, we use the well-established Middleburystereo database [15] as input data sets. Without losinggenerality, we evaluate the quality of the synthesizedcenter-view image I0:5 (s ¼ 0:5), which is compared withthe ground-truth center image I�0:5. More specifically, weuse the second and sixth view as the input stereo pairfI�0; I

�1g from the Venus, Teddy, and Cones data set, while the

second and fourth image is used for Tsukuba.

ARTICLE IN PRESS


We assign an optimal window size to each costaggregation approach, based on the performance studyresults of real-time stereo correspondence algorithms byGong et al. [7]. We set the window size to 11� 11 for boththe square-window and shiftable window approaches,21� 21 for the boundary-guided approach (BO), and 33�33 for both the truncated window (TR) and adaptiveweight (AW) approaches. Roughly in ascending order ofoverall stereo performance, Fig. 9 shows the quantitativedisparity estimation error of these stereo algorithms, andFig. 10 shows the depth map results from the variousaggregation approaches for Teddy.

We have carefully optimized all stereo correspondencealgorithms, by applying the optimization schemes sum-marized in Section 4. All the experiments are performedon an NVIDIA GeForce 8800GT graphics card with 512MBGDDR3 video memory.

5.1. Performance measure and region masks

For image interpolation, Baker et al. [1] used the rootmean squared error (RMSE), and a gradient-normalizedRMSE (NRMSE) as quality metrics. In practice, weobserved that NRMSE is more consistent with humanperceptual quality, as it avoids the over-penalty ofinterpolation errors along highly textured areas. There-fore, we adopt NRMSE as the principle performancemeasure for the quality assessment in this paper. TheNRMSE is computed by using the grey-scaled imagesIðx; yÞ ¼ g � Iðx; yÞ, according to

NRMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXx;y

ðI0:5ðx; yÞ � I�0:5ðx; yÞÞ2

krI�0:5ðx; yÞk2 þ 1

vuut . (22)

02468

101214161820

Tsukuba Venus Teddy Cones

Non

-Occ

lude

d Er

ror

Rat

e (%

)

SQ SH BO TR AW

Fig. 9. The error rates at non-occluded regions of the estimated disparity

maps, using five different local stereo approaches.

Fig. 10. The estimated disparity map for the Teddy scene, using the (a) squa

windows, and (e) adaptive weights approach.

At present, we compute the NRMSE of the interpolatedimage over two types of region masks: all and non-occluded (nonocc) for the center viewpoint. Different fromthe all region mask in Baker et al.’s work [1], we do notexclude the image border regions nor half-occluded pixels,as our baseline hole handling module tackles these areasadequately. The nonocc region mask maps for Venus, Teddy,and Cones are shown in Fig. 11, which are warped from theleft viewpoint, using the ground-truth disparity maps.Since Tsukuba does not have a ground-truth disparity mapfor the second viewpoint from its data set, we are unableto build the occlusion map for this scene. As revealed inBaker et al.’s evaluation, most stereo correspondencealgorithms lead to quite good interpolation quality fortextureless regions, so we do not report the performancefor these regions separately.

5.2. Forward warping performance

This experiment uses the proposed baseline solutionfor forward warping, while switching between the squarewindow, shiftable window, boundary-guided, truncatedwindows, and adaptive weights cost aggregation algo-rithms. We used a fixed truncation value of t ¼ 32, whichwill be further discussed later, and the generated disparitymap is filtered by the 3� 3 median filter. Our resultsdepicted in Fig. 12 indicate that:

�

re w

FigTed

are

In terms of nonocc NRMSE (Fig. 12b), shiftable windowclearly performs the worst. In particular, squarewindow does a good job, which indicates that over-fattening is ‘healthy’ in this case, so it performs betterthan shiftable window in view synthesis.
� Square window, truncated windows, and adaptive
weights are close to each other, when consideringnonocc NRMSE. Especially for Venus, all aggregationapproaches perform almost equivalent, because of itslarge planar areas.

indows, (b) shiftable windows, (c) boundary-guided, (d) truncated

. 11. The generated center-view occlusion masks for (a) Venus, (b)

dy, and (c) Cones, based on the ground-truth disparity maps. Errors

only evaluated in the white regions.

ARTICLE IN PRESS

0

0.5

1

1.5

2

2.5

3

3.5

SQ

All

NR

MSE

Tsukuba VenusTeddy Cones

0

0.5

1

1.5

2

2.5

Non

occ

NR

MSE

Venus TeddyCones

0

0.5

1

1.5

2

2.5

3

0

All

NR

MSE

Algorithm Execution Time (ms)

BOSQTR

SHAW

Venus

Teddy

ConesTsukuba

0

0.2

0.4

0.6

0.8

1

1.2

Venu

s N

RM

SE

All NRMSENonocc NRMSE



0

0.5

1

1.5

2

2.5

Tedd

y N

RM

SE

0

0.5

1

1.5

2

2.5

3

3.5

Con

es N

RM

SE

SH BO TR AW

SQ SH BO TR AW SQ SH BO TR AW SQ SH BO TR AW

SQ SH BO TR AW

5 10 15 20

Fig. 12. Performance of the forward warping for different cost aggregation approaches. (a) All NRMSE, (b) nonocc NRMSE, (c) quality-complexity plot, and

the comparison between all and nonocc NRMSE for (d) Venus, (e) Teddy, and (f) Cones scene.

Fig. 13. The forward warped view with holes in yellow for (a) the square

windows and (b) the adaptive weights approach.


�
In terms of all NRMSE (Fig. 12a), the adaptive weightsapproach yields the overall best quality, followed bytruncated windows. This means that hole handling inhalf-occluded and border regions benefits from accu-rate depth. Fig. 13 gives such an example, comparingthe hole map of square window and adaptive weights.Again, shiftable window performs worse than squarewindow. � From Fig. 12d–f, it is clear that view synthesis for
all regions are consistently more difficult than viewsynthesis for nonocc regions, and for all algorithms.This makes sense, because non-occluded regions arevisible in the two input images.

�
It can be noticed from Fig. 12d that shiftable windowhas the largest performance degradation when chan-ging from nonocc to all NRMSE. This indicates that thebaseline hole handling algorithm performs very wellfor the other algorithms, but that it cannot be regardedas a cure-all for stereo correspondence algorithms thatperform poorly in all regions. � When view synthesis is concerned, it is observed from
Fig. 12a, that Cones is more difficult than Tsukuba,Teddy, and finally Venus.
� Considering the quality-complexity plot in Fig. 12c, it
is observed that all algorithms, for all data sets, runin real-time. Even for the adaptive weights approachrunning on Teddy and Cones, the execution speedachieves over 50 fps. Surprisingly, square windowproduce fair quality given its low complexity. Thereasons are two-fold: (1) over-fattening is sometimesgood for view interpolation [1]; and (2) the proposedbaseline hole handling scheme assists in improving theoverall synthesis quality.

5.3. Inverse warping performance

In the following experiment we use the baselineimplementation of the inverse warping method to

ARTICLE IN PRESS

0

0.5

1

1.5

2

2.5

All

NR

MSE

00.20.40.60.8

11.21.41.61.8

Non

occ

NR

MSE

0

0.5

1

1.5

2

2.5

All

NR

MSE


VenusTeddy

ConesTsukuba

0.440.450.460.470.480.490.5

0.510.520.53

Venu

s N

RM

SE

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tedd

y N

RM

SE

0

0.5

1

1.5

2

2.5

Con

es N

RM

SE

0 5 10 15 20

SQ SH BO TR AW

SQ SH BO TR AW SQ SH BO TR AW

SQ SH BO TR AW SQ SH BO TR AW

Tsukuba VenusTeddy Cones

Venus TeddyCones

BOSQTR

SHAW




Fig. 14. Performance of the inverse warping for different cost aggregation approaches. (a) All NRMSE, (b) nonocc NRMSE, (c) quality-complexity plot, and

the comparison between all and nonocc NRMSE for (d) Venus, (e) Teddy, and (f) Cones scene.


evaluate the influence of the different cost aggregationapproaches for this view synthesis method. The trunca-tion value is also set to t ¼ 32, and median filtering isenabled. Our results depicted in Fig. 14 indicate that:

�

Fig. 15. Different distributions of holes (shown in yellow), when the

In terms of nonocc NRMSE and all NRMSE, truncatedwindows is the winner, followed by adaptive weights.In the particular case of Cones, the adaptive weightsapproach is better than truncated windows, because itcontains more subtle scene geometries.

shiftable window approach is used with (a) the forward warping mode,
�
and (b) the inverse warping mode.
Shiftable window is no longer significantly worse thanthe others, as opposed to the observation from theforward warping method. This indicates that inversewarping is not as sensitive as forward warping,to disparity errors in all regions and nonocc regions.The reason is that the baseline hole handling algorithmperforms better when holes are detected rather thanprojected. This is illustrated with the difference shownin Fig. 15, which gives an example of the shiftablewindow used with forward warping, and inversewarping. � Although the boundary-guided aggregation performs
similar to square window in forward warping, itsperformance is worse than square window if theinverse warping is applied, when considering Teddy

and Cones. The reason is that these data sets have a

relatively large baseline distance, but the boundary-guided aggregation is not directly designed for inversewarping. More specifically, the boundary-guided ag-gregation relies on the left edge map, which cannot beaccurately used for the center view.
� For nearly the same reason (but not so severely,
because of soft weighting), the adaptive weightsapproach does not lead to the best view synthesisquality when used with inverse warping, as for theforward warping across all tested data sets.
� From Fig. 14d–f, it is clear that view synthesis for
all regions are consistently more difficult than viewsynthesis for nonocc regions, and for all algorithms.

ARTICLE IN PRESS


�
Considering the quality-complexity plot in Fig. 14c, it isclear that truncated windows is the optimal trade-offpoint, running two times faster than adaptive weights,but yielding less interpolation errors. In addition,square window is a good candidate as well, when theview synthesis is performed on a weak GPU. However,it does not generate an accurate depth map, if it isdesired for other high-level vision tasks.
5.4. Cross comparing forward and inverse warping

The results of the previous two experiments thatevaluated the forward and inverse warping are organizedin a common context. We perform a cross comparisonbetween the two warping methods. The results depictedin Fig. 16 indicate the following:

�

Venu

s A

ll N

RM

SEVe

nus

Non

occ

NR

MSE

FigVen

Fig. 17. Different gradient-normalized difference maps, when the

In terms of nonocc NRMSE and all NRMSE, inversewarping performs consistently better than forwardwarping, for all tested stereo algorithms. Also, theinverse warping curve has a less amount of fluctuationthan forward warping. This indicates that forwardwarping is more dependent on depth accuracy char-acteristics.

shiftable windows approach is used with (a) the forward warping mode,
�
and (b) the inverse warping mode.

The average performance gap between the inverse andforward warping modes is reduced, as the estimated

0

0.2

0.4

0.6

0.8

1

1.2

Forward WarpingInverse Warping

ForwardInverse


ForwardInverse

0

0.5

1

1.5

2

2.5

Tedd

y A

ll N

RM

SE

00.10.20.30.40.50.60.70.80.9

00.20.40.60.8

11.21.41.61.8

2

Tedd

y N

onoc

c N

RM

SE

SQ SH BO TR AW SQ SH BO

SQ SH BOSQ SH BO TR AW

. 16. Cross comparison of the forward and inverse warping mode, measuring all

us, (e) Teddy, and (f) Cones.

depth map gets more accurate. This makes sense, asthe two types of hole maps will be more aligned withthe depth discontinuities, so the unified hole handlingcan treat these holes on a more solid basis.
� The shiftable window approach gains the most by
changing from the forward to inverse warping mode,which is most visible from Fig. 16e, considering Teddy.Fig. 17a shows the normalized difference map forTeddy, when the center view is synthesized in theforward warping mode. On the other hand, Fig. 17bdepicts the normalized difference map when shiftablewindow is used in the inverse warping method. Asignificant decrease in edge and overall errors can be

WarpingWarping


WarpingWarping


0

0.5

1

1.5

2

2.5

3

3.5C

ones

All

NR

MSE

0

0.5

1

1.5

2

2.5

Con

es N

onoc

c N

RM

SE

TR AW SQ SH BO TR AW

SQ SH BO TR AWTR AW

NRMSE for (a) Venus, (b) Teddy, (c) Cones, and nonocc NRMSE for (d)

ARTICLE IN PRESS

Figme


noticed, proving the absolute superiority of the inversewarping method.
� When a single view is desired from the given input
image pair, inverse warping is generally preferred overthe forward warping mode. There is noticeable qualitydifference, but the execution speed remains very closeto each other (see Fig. 18).

5.5. Stereo sensitivity

In this experiment we investigate the sensitivity ofthe resulting view synthesis quality to median filteringused for disparity refinement, as well as the impactof variations of the truncation value in the matchingcost computation. Fig. 19a–c shows the quality impactof median filtering for the nonocc NRMSE, whichindicates:

�
The various cost aggregation approaches seem to bequite insensitive to median filtering, except for theadaptive weight approach which always benefits fromthis type of disparity refinement. This finding isconsistent with [7], which already noticed that theadaptive weights approach significantly benefits from
2

1.5

2.5

3

All

NR

MSE


SQ SH BO TR AW

Forward Warping

0 5 10 15 20

Inverse Warping

. 18. Performance comparison of the forward and inverse warping

thod, using the most complex Cones scene.

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

Venu

s N

onoc

c N

RM

SE

SQ SH BO TR AW

Median Filter EnabledMedian Filter Disabled

0.50.55

0.60.65

0.70.75

0.80.85

0.9

Tedd

y N

onoc

c N

RM

SE

SQ SH BO

Median FMedian

Fig. 19. Impact of median filtering for the inverse warping method, meas

median filtering for the accuracy of the resulting depthmaps.
� In planar regions, both the square window and
boundary-guided aggregation approach also show asignificant increase in quality (Fig. 19a).
� Considering its low-complexity, in general median
filtering is best enabled, since no approach is reallyhurt in performance.

Furthermore, Fig. 20a–c depicts the sensitivity of viewsynthesis to the truncation value setting. These resultsindicate that:

�
The view synthesis quality is much less sensitive totruncation value variations, compared with the result-ing quality of the generated depth maps [7]. � Low truncation values in the matching cost lead to a
significant decrease in view synthesis quality. Startingfrom t ¼ 32 to 64, the quality remains quite stable andthe view synthesis is relatively insensitive to furthervariations. Nonetheless, we set t ¼ 32 as the optimalvalue, since the accurate disparity maps are oftenfavored for hole handling, geometrically correct viewsynthesis, and possibly other high-level vision tasks.

5.6. Complexity profiling

Although the inverse warping method is the obviouswinner in the cross-comparison, this does not renderforward warping useless. As previously indicated, theforward and inverse warping have a near equivalentexecution speed, and the workload distribution for bothwarping methods is also quite similar for a single view-point synthesis. Fig. 21a and b show the profiled workloaddistribution of all processing modules involved with theinverse warping method, using square window andadaptive weights, respectively. These profiling resultsclearly indicate that:

�
When using adaptive weights, the cost aggregation isby far the largest part of the complexity in stereo-
1.351.4

1.451.5

1.551.6

1.651.7

1.75

Con

es N

onoc

c N

RM

SE

SQ SH BO TR AWTR AW

ilter EnabledFilter Disabled

Median Filter EnabledMedian Filter Disabled

uring nonocc NRMSE for (a) Venus, (b) Teddy, and (c) Cones.

ARTICLE IN PRESS

0

0.5

1

1.5

2

2.5

SQ N

onoc

c N

RM

SE

Truncation ValueVenus Teddy Cones

0

0.5

1

1.5

2

2.5

TR N

onoc

c N

RM

SE


0

0.5

1

1.5

2

2.5

AW

Non

occ

NR

MSE


4 8 16 32 64 128 2554 8 16 32 64 128 255 4 8 16 32 64 128 255

Fig. 20. View synthesis quality measured in nonocc NRMSE, as a function of matching cost truncation values. Tested are (a) square window, (b) truncated

window, and (c) adaptive weights approach, in combination with the inverse warping method.

0%

Tsukuba

Venus

Teddy

Cones

Tsukuba

Venus

Teddy

Cones

20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

Cost Comp. Aggr. & Sel.Refinement Inv. Warp & Hole Det.Hole Handling

Cost Comp. Aggr. & Sel.Refinement Inv. Warp & Hole Det.Hole Handling

Fig. 21. Workload distribution for the inverse warping mode, using (a) square windows, and (b) adaptive weights algorithm.


based view synthesis. In contrast, the square windowapproach consumes a drastically lower complexity forthe aggregation.
� For an optimal end-to-end performance considering
stereo-based view synthesis algorithms, a good com-plexity allocation is desired over these processingmodules, which implies the optimal quality-complex-ity trade-off of the truncated windows approach.
� Since inverse warping needs to generate a projected
disparity map for every requested view—repeating themost time-consuming cost aggregation step, multipleview synthesis for a given stereo pair can rapidly leadto the proliferation of the total algorithm complexity.
� For multiple view synthesis, the forward warping
method represents an optimal quality-complexitytrade-off, since it only needs to generate a singledisparity map for any requested intermediate view.

6. Conclusion and future work

We have proposed a framework that unifies stereocorrespondence and view synthesis, dissecting typicalstereo-based view synthesis algorithms into a set ofchained algorithmic building blocks. On this basis, wehave developed a flexible real-time software framework

on the GPU, which contains various implementationsof sample algorithmic building blocks, with specificinterest to cost aggregation and image warping modules.We implemented the square window, shiftable window,boundary-guided, truncated windows, and adaptiveweights approaches in a common platform, and they arespecifically optimized in the context of view synthesis-driven GPU-based computing.

In combination with the baseline implementations forboth the forward and inverse warping approaches, theselocal stereo methods have been comparatively evaluatedin terms of view synthesis quality and processing speed.Performed on the same commodity GPU, our experimentshave suggested a rich set of major conclusions, which areuseful guides to the future design of an optimal stereo-based view synthesis system. In particular, we have foundthat the forward warping mode is more sensitive togeometry accuracy, while inverse warping performsconsistently better across all test data sets, given thesame stereo matching approach. As there is little differ-ence in the execution speed, inverse warping ought to befavored over forward warping, when one intermediateview (e.g. the center view) is desired from a given stereopair. When considering multiple view synthesis however,the forward warp proves its potential to strike an optimaltrade-off, as it avoids regenerating a depth map for each

ARTICLE IN PRESS


intermediate view. Concerning the relative performance ofdifferent cost aggregation approaches, it is observed thatthe truncated windows approach achieves an optimaltrade-off between synthesis quality and real-time speed.It is especially so, when different stereo algorithms aretested with the inverse warping mode. Furthermore, weobserved that all the test view synthesis combinationscan achieve over a minimum of 50 fps on our main-stream GPU. Interestingly, the square window approach—

although being the fastest approach—yields quite reason-able view synthesis quality. This suggests that for high-speed view synthesis on weak GPUs, the square windowapproach combined with the baseline view synthesisimplementations presented in this paper could be a gooddesign choice.

For future work, we plan to perform an explicitmeasurement and comparison of the view synthesisperformance along depth discontinuities. Furthermore,an interesting research option is to make an improvedadaptation of the original boundary-guided and adaptiveweights aggregation approaches, when they are connectedto the inverse warping mode.

Since the recent introduction of the full interoper-ability of Direct3D with CUDA v2.0, we will carefullyinvestigate which kernels are most appropriate for eitherthe traditional and the next-gen GPGPU paradigm.The graphics hardware can therefore always be exploitedin its most efficient way, resulting in an optimalintertwined implementation. As the horsepower of amodern GPU keeps rising, and next-generation GPGPUparadigms exhibit more and more programming flexibil-ity, we envision that more accurate yet computationallychallenging stereo and synthesis implementations can beincluded into the present real-time software frameworktoward the near future.

Acknowledgement

Sammy Rogmans would like to thank the financialsupport by the IWT, within the Hasselt University, undergrant number SB071150.

References

[1] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, R. Szeliski, Adatabase and evaluation methodology for optical flow, in: Proceed-ings of IEEE International Conference on Computer Vision, Rio deJaneiro, Brazil, 2007.

[2] Y. Deng, Q. Yang, X. Lin, X. Tang, Stereo correspondence withocclusion handling in a symmetric patch-based graph-cuts model,IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 1068–1079.

[3] J.-P. Farrugia, P. Horain, Gpucv: A framework for image processingacceleration with graphics processors, in: International Conferenceon Multimedia and Expo, Toronto, Canada, 2006.

[4] J. Fung, S. Mann, OpenVIDIA: parallel GPU computer vision, in:Proceedings of the ACM International Conference on Multimedia,Hilton, Singapore, 2005.

[5] I. Geys, T.P. Koninckx, L.V. Gool, Fast interpolated cameras bycombining a GPU based plane sweep with a max-flow regularisa-tion algorithm, in: Proceedings of the 3DPVT, Thessaloniki, Greece,2004.

[6] M. Gong, R. Yang, Image-gradient-guided real-time stereo ongraphics hardware, in: Proceedings of the International Conferenceon 3-D Digital Imaging and Modeling, Washington, DC, USA, 2005.

[7] M. Gong, R. Yang, L. Wang, M. Gong, A performance study ondifferent cost aggregation approaches used in real-time stereomatching, Int. J. Comput. Vision 75 (2) (2007) 283–296.

[8] D. Harwood, M. Subbarao, H. Hakalahti, L. Davis, A new class ofedge-preserving smoothing filters, Pattern Recognition Lett. 6 (3)(1987) 155–162.

[9] J. Kim, K. Lee, B. Choi, S. Lee, A dense stereo matching using two-pass dynamic programming with generalized ground controlpoints, in: IEEE Conference on Computer Vision and PatternRecognition, vol. 2, San Diego, USA, 2005.

[10] J. Lu, G. Lafruit, F. Catthoor, Fast variable center-biased windowingfor high-speed stereo on programmable graphics hardware, in:Proceedings of the IEEE International Conference on Image Proces-sing, San Antonio, TX, USA 2007.

[11] J. Lu, S. Rogmans, G. Lafruit, F. Catthoor, High-speed dense stereo viadirectional center-biased support windows on programmablegraphics hardware, in: IEEE Computer Society 3DTV-CON, Kos,Greece, 2007.

[12] J. Lu, S. Rogmans, G. Lafruit, F. Catthoor, High-speed stream-centricdense stereo and view synthesis on graphics hardware, in:Proceedings of the IEEE International Workshop on MultimediaSignal Processing, Crete, Greece, 2007.

[13] A. Mancini, J. Konrad, Robust quadtree-based disparity estimationfor the reconstruction of intermediate stereoscopic images, in:Proceedings of the SPIE Stereoscopic Displays and Virtual RealitySystems, vol. 3295, 1998.

[14] L. Matthies, T. Kanade, R. Szeliski, Kalman filter-based algorithmsfor estimating depth from image sequences, Int. J. Comput. Vision 3(3) (1989) 209–238.

[15] Middlebury Stereo Vision page hhttp://vision.middlebury.edu/stereoi.

[16] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn,T. Purcell, A survey of general-purpose computation on graphicshardware, Comput. Graphics Forum 26 (1) (2007) 80–113.

[17] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vision 47(1–3) (2002) 7–42.

[18] J. Shade, S. Gortler, L.-W. He, R. Szeliski, Layered depth images, in:Proceedings of the ACM SIGGRAPH, New York, USA, 1998.

[19] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G.Akar, G. Triantafyllidis, A. Koz, Coding algorithms for 3DTV—asurvey, IEEE Trans. Circuits Systems Video Technol. 17 (11) (2007)1606–1621.

[20] C. Strecha, R. Fransens, L.V. Gool, Wide-baseline stereo from multipleviews: a probabilistic account, in: IEEE Conference on ComputerVision and Pattern Recognition, vol. 1, Los Alamitos, USA, 2004.

[21] J. Sun, Y. Li, S. Kang, H.-Y. Shum, Symmetric stereo matching forocclusion handling, in: IEEE Conference on Computer Vision andPattern Recognition, San Diego, USA, 2005.

[22] R. Szeliski, Video mosaics for virtual environments, IEEE Comput.Graphics Appl. 16 (2) (1996) 22–30.

[23] R. Szeliski, Prediction error as a quality metric for motion andstereo, in: Proceedings of the International Conference on Compu-ter Vision, vol. 2, Kerkyra, Greece, 1999.

[24] R. Szeliski, P. Golland, Stereo matching with transparency andmatting, Int. J. Comput. Vision 32 (1) (1999) 45–61.

[25] O. Veksler, Fast variable window for stereo correspondence usingintegral images, in: IEEE Conference on Computer Vision andPattern Recognition, Madison, USA, 2003.

[26] L. Wang, M. Gong, M. Gong, R. Yang, How far can we go with localoptimization in real-time stereo matching, in: Proceedings of the3DPVT, Chapel Hill, USA, 2006.

[27] L. Wang, M. Liao, M. Gong, R. Yang, D. Nister, High-quality real-timestereo using adaptive cost aggregation and dynamic programming,in: Proceedings of the 3DPVT, Chapel Hill, USA, 2006.

[28] G. Wolberg, Digital Image Warping, IEEE Computer Society Press,Silver Spring, MD.

[29] R. Yang, G. Welch, G. Bishop, Real-time consensus-based scenereconstruction using commodity graphics hardware, in: Proceed-ings of the Pacific Graphics, Beijing, China, 2002.

[30] K.-J. Yoon, I.-S. Kweon, Adaptive support-weight approach forcorrespondence search, IEEE Pattern Anal. Mach. Intell. 28 (2006)650–656.

[31] L. Zhang, D. Wang, A. Vincent, Adaptive reconstruction of inter-mediate views from stereoscopic images, IEEE Trans. CircuitsSystems Video Technol. 16 (1) (2006) 102–113.

[32] C.L. Zitnick, S.B. Kang, Stereo for image-based rendering usingimage over-segmentation, Int. J. Comput. Vision 75 (1) (2007)49–65.

http://vision.middlebury.edu/stereo

http://vision.middlebury.edu/stereo

Real-time stereo-based view synthesis algorithms: A unified framework and evaluation on commodity...

Documents

Transcript of Real-time stereo-based view synthesis algorithms: A unified framework and evaluation on commodity...