FullText (36)

transcript

8/17/2019 FullText (36)

1/144

TitleMutual information-based depth estimation and 3Dreconstruction for image-based rendering systems

Advisor(s) Chan, SC; Chang, C

Author(s) Zhu, Zhenyu;g c [‡

Citation

Issued Date 2012

URL http://hdl.handle.net/10722/173910

RightsThe author retains all proprietary rights, (such as patent rights)and the right to use in future works.

8/17/2019 FullText (36)

2/144

Mutual Information-based Depth Estimation

and 3D Reconstruction for Image-basedRendering Systems

by

ZHU Zhenyu (朱 )

B.Eng.

Ph. D. Thesis

A thesis submitted in partial fulfillment of the requirements for

the Degree of Doctor of Philosophy

at the University of Hong Kong

July 2012

8/17/2019 FullText (36)

3/144

I

Abstract of thesis entitled

Mutual Information-based Depth Estimation

and 3D Reconstruction for Image-based

Rendering Systems

Submitted by

ZHU Zhenyu

for the degree of Doctor of Philosophy

at the University of Hong Kong

in July 2012

Image-based rendering (IBR) is an emerging technology for

rendering photo-realistic views of scenes from a collection of densely

sampled images or videos. It provides a framework for developing

revolutionary virtual reality and immersive viewing systems. There has

been considerable progress recently in the capturing, storage and

transmission of image-based representations. This thesis proposes two

image-based rendering (IBR) systems for improving the viewing

freedom and environmental modeling capability of conventional static

IBR systems. The first system consists of a circular array with 13 still

cameras (Canon 550D) for capturing ancient Chinese artifacts at high

resolution. The second one is constructed by mounting a linear array of 8

video cameras (Sony HDR-TGIE) on an electrically controllable wheel

8/17/2019 FullText (36)

4/144

II

chair with its motion being controllable manually or remotely through

wireless local area network (LAN) by means of additional hardware

circuitry.

Both systems support object-based rendering and 3D reconstruction

capability and consist of two main components. 1) A novel view

synthesis algorithm using a new segmentation and mutual information

(MI)-based algorithm for dense depth map estimation, which relies on

segmentation, local polynomial regression (LPR)-based depth map

smoothing and MI-based matching algorithm to iteratively estimate the

depth map. The method is very flexible and both semi-automatic and

automatic segmentation methods can be employed. They rank fourth and

sixth, respectively, in the Middlebury comparison of existing depth

estimation methods. This allows high quality renderings of outdoor and

indoor scenes with improved mobility/freedom to be obtained. This

algorithm can also be extended to object tracking. Experimental results

also show that the proposed MI-based algorithms are applicable to

robust registration in noisy dynamic ultrasound images. 2) A new 3D

reconstruction algorithm which utilizes sequential-structure-from-motion

(S-SFM) technique and the dense depth maps estimated previously. It

relies on a new iterative point cloud refinement algorithm based on

Kalman filter (KF) for outlier removal and the segmentation-MI-based

algorithm to further refine the correspondences and the projection

matrices. The mobility of our system allows us to recover more

conveniently 3D model of static objects from the improved point cloud

using a new robust radial basis function (RBF)-based modeling

algorithm to further suppress possible outliers and generate smooth 3D

meshes of objects. Moreover, a new rendering technique named view

dependent texture mapping is used to enhance the final rendering effect.

8/17/2019 FullText (36)

5/144

III

Experimental results show that the proposed 3D reconstruction

algorithm significantly reduces the adverse effect of the outliers and

produces high quality renderings using view dependent texture mapping

and the model reconstructed.

Overall, this study provides a framework for designing IBR systems

with improved viewing freedom and ability to cope with moving and

static objects in indoor and outdoor environment.

An abstract of exactly 439 words

8/17/2019 FullText (36)

6/144

IV

Declaration

I hereby declare that this dissertation, submitted in partial

fulfillment of the requirements for the degree of Philosophy and entitled

“Mutual Information-based Depth Estimation and 3D Reconstruction for

Image-based Rendering Systems” represents my own work except where

due acknowledgement is made, and has not been previously included in

a thesis, dissertation, or report submitted to this or any other institution

for a degree, diploma or other qualification.

Zhu Zhenyu

August 2012

8/17/2019 FullText (36)

7/144

V

Acknowledgement

First of all, I would like to extend my sincere gratitude to my

supervisor, Dr. S. C. Chan and Dr. C. Q. Chang for their instructive

advice and useful suggestions on my thesis. Without their consistent and

illuminating instruction, this thesis could not be possible.

Besides, I highly appreciate all postgraduate students and staffs in

the Digital Signal Processing (DSP) Laboratory for their helpful

discussion and support. They are: Dr. K. T. Ng, Dr. Z. G Zhang, Dr. K.

M. Tsui, Mr. James Koo, Mr. B. Liao, Mr. C. Wang, Mr. S. Zhang, Mr.

H. C. Wu and Miss Y. J. Chu.

Last, but not the least, my thanks go to my beloved family for their

patience and love in me all through these years.

8/17/2019 FullText (36)

8/144

VI

Contents

DECLARATION .................................................................................. IV

ACKNOWLEDGEMENT ..................................................................... V

CONTENTS .......................................................................................... VI

LIST OF FIGURES ............................................................................... X

LIST OF TABLES .............................................................................. XV

LIST OF ABBREVIATIONS .......................................................... XVI

CHAPTER 1 INTRODUCTION ........................................................... 1

1.1 BACKGROUND .................................................................................. 1

1.2 THESIS OUTLINE .............................................................................. 4

CHAPTER 2 REVIEW OF BASIC TOPICS IN IMAGE-BASED

RENDERING .......................................................................................... 9

2.1 I NTRODUCTION ................................................................................ 9

2.2

R EVIEW OF PLENOPTIC FUNCTION ................................................... 9

2.2.1

Basic Theory .......................................................................... 10

8/17/2019 FullText (36)

9/144

VII

2.3 R EVIEW OF LIGHT FIELD ................................................................ 13

2.3.1

Creating/capturing light field ................................................ 13

2.4 R EVIEW OF R ENDERING TECHNIQUES ............................................ 14

2.5

SUMMARY ...................................................................................... 18

CHAPTER 3 THE PROPOSED IMAGE-BASED RENDERING

SYSTEMS .............................................................................................. 20

3.1

I NTRODUCTION .............................................................................. 20

3.2

CONSTRUCTION OF THE PROPOSED IBR SYSTEMS .......................... 23

3.2.1

Still Camera System ............................................................... 23

3.2.2 Moveable Camera System ...................................................... 26

3.3

PRE-PROCESSING ........................................................................... 30

3.3.1

Still Camera System ............................................................... 30

3.3.1.1 Camera Calibration ............................................................. 30

3.3.1.2 Color-Tensor-based Segmentation and Matting .................. 33

3.3.2

Moveable Camera System ...................................................... 36

3.3.2.1 Video Stabilization ............................................................... 36

3.4 SUMMARY ...................................................................................... 44

8/17/2019 FullText (36)

10/144

VIII

CHAPTER 4 A NEW COMBINED SEGMENTATION-MUTUAL-

INFORMATION (MI)-BASED ALGORITHM FOR DENSE

DEPTH MAP ESTIMATION .............................................................. 45

4.1

I NTRODUCTION .............................................................................. 45

4.2

COMBINED SEGMENTATION-MI-BASED DEPTH ESTIMATION ......... 46

4.2.1

Object Segmentation Using Level-Set Method ...................... 47

4.2.2

Mutual Information Matching ............................................... 49

4.3 DEPTH MAP R EFINEMENT .............................................................. 54

4.3.1

Occlusion Detection and Inpainting ...................................... 56

4.3.2

Smoothing of Depth Maps ...................................................... 56

4.4 MUTUAL I NFORMATION (MI)-BASED OBJECT TRACKING .............. 63

4.5

MORE R ESULTS AND COMPARISON ................................................ 65

4.6 SUMMARY ...................................................................................... 69

CHAPTER 5 3D RECONSTRUCTION AND MODELING .......... 71

5.1

I NTRODUCTION .............................................................................. 71

5.2

HOMOGENEOUS GEOMETRY ........................................................... 74

5.3 POINT MATCHING IN THE STILL IBR SYSTEM ................................ 76

5.3.1

Epipolar Geometry................................................................. 76

8/17/2019 FullText (36)

11/144

IX

5.3.2 Finding Correspondent Points............................................... 77

5.4

VIEW DEPENDENT TEXTURE MAPPING .......................................... 82

5.5 POINT MATCHING AND R EFINEMENT IN THE MOVEABLE IBR SYSTEM ................................................................................................. 88

5.5.1

Structure-from-motion ........................................................... 88

5.5.2

Point Cloud Generation and Refinement: KF-based Outlier

detection and Point Cloud Fusion ................................................... 90

5.6 RBF MODELING AND MESH GENERATION ...................................... 98

5.7

SUMMARY .................................................................................... 102

CHAPTER 6 CONCLUSION AND FUTURE RESEARCH ........ 103

6.1 CONCLUSION ................................................................................ 103

6.2

FUTURE R ESEARCH ...................................................................... 105

APPENDIX I PUBLICATIONS ...................................................... 107

REFERENCES .................................................................................... 109

8/17/2019 FullText (36)

12/144

X

List of Figures

Figure 1-1 Spectrum of IBR representations.………………… 2

Figure 2-1 Light field describes the amount of light in

radiance along light rays traveling in every

direction through every point in empty space [Ikeu

2012]..…………………………………………….. 10

Figure 2-2 Forward mapping……………........……..………... 16

Figure 2-3 Example renderings using (a) forward mapping in

point rendering [Chan 2005], (b) layered

representation (with two layers – dancer and

background) [Chan 2009], (c) monolithic

rendering using 3D polygonal mesh (left) and

rendering results (right) [Zhu 2010]…………..….. 19

Figure 3-1 Plenoptic videos: Multiple linear camera array of

4D simplified dynamic light field with viewpoints

constrained along line segments. The camera

arrays developed at [Chan2009]. Each consists of

6 JVC video cameras……….......................……… 21

Figure 3-2 Circular camera array constructed……….…….…. 24

Figure 3-3 Snapshots: (a)Buddha (b) Dragon Vase…..………. 25

Figure 3-4 Block diagram of the proposed IBR system…….... 26

Figure 3-5 The proposed moveable image-based rendering

system………………………………..…………… 27

Figure 3-6 Snapshots of the plenoptic videos at a given time

instance: (a) is the “ Podium” outdoor video from

camera 1 to camera 4 and (b) is the “ Presentation”

8/17/2019 FullText (36)

13/144

XI

indoor video from camera 1 tocamera …………………………………..……....... 29

Figure 3-7 Block diagram of the proposed M-IBR system

constructed………………………………………... 30

Figure 3-8 Relationship between the world coordinate and the

camera coordinate……………..……..………….... 31

Figure 3-9 Planar patten............................................................ 33

Figure 3-10 (a) Extraction results using color-tensor-basedmethod. Left: original, middle: hard segmentation,

right: after matting. (b) Close up of segmentations

in (a). Left: hard segmentation, Right: aftermatting………………………………………….... 36

Figure 3-11 Motion smoothing results for horizontal(Translation-x) and vertical (Translation-y)

directions. The original motion path and the

smoothed motion path with different methods are

shown. In (a)-(b), the blue dotted lines correspond

to the shaky original motion path. Green and blacklines correspond to the smoothed motion path

using the method in [Mats 2005] with a small and

a large kernel sizes respectively…………………... 43

Figure 3-12 Video Stabilization result. The first row shows the

original images captured by our system; thesecond row shows the stabilized images without

video completion; the third row shows thecompleted results…………………………………. 43

Figure 4-1 Segmentation results using the level-set-basedtracking method. (a) is the initial segmentation

obtained by lazy snapping, (b) is the initial

segmentation obtained by graph cut method…....... 48

Figure 4-2 A regular grid for Local Transformation…………. 51

Figure 4-3 (a) is an example depth map obtained by using MImatching without segmentation information; (b)

shows the depth map obtained by using automatic

8/17/2019 FullText (36)

14/144

XII

segmentation MI matching; (c) shows the depthmap obtained by using semi-automatic

segmentation MI matching. Green areas in (c) are

the occlusion areas detected by our algorithm. (d)-

(e) show the refined depth maps of (c) byinpainting and smoothing (c) using SK-LPR-R-ICIand 2525 ideal low-pass filter, respectively…… 55

Figure 4-4 (a) and (b) show the renderings obtained by Figs.

4-2(d) and (b); (c) and (d) are the enlargements of

the red boxes…………………………………….... 59

Figure 4-5 Rendering results obtained by the proposed

algorithm. (a) shows the depth maps

corresponding to images in (b). The highlightedimages in (b) shows the rendered views from the

adjacent views in (b) using depth maps in (a). (c)

shows depth maps at other positions……….…….. 62

Figure 4-6 Example rendering results. The first row shows theoriginal images captured by our M-IBR system.

The second and third rows show renderings with a

step-in ratio of about 1.15 to 1.25 times.................. 63

Figure 4-7 Object tracking at different time instances……...... 65

Figure 4-8 Teddy test images [Scha2002] and depth maps for

comparison. (a) LEFT image; (b) RIGHT image;

(c) ground truth depth map; (d) depth map

calculated by semi-automatic segmentation-based

MI matching; (e) depth map calculated by

automatic segmentation-based MI matching……... 67

Figure 4-9 Results for the “conference”. (a) and (c) are twosample frames. (b) and (d) are the depth maps of

(a) and (c), respectively……………………......…. 68

Figure 4-10 Ultrasound images of RF muscle under relaxed

condition and at 50% maximal voluntarycontraction (MVC) contraction level and the

corresponding images with outlined boundary

contours. The tracked boundaries are highlighted

8/17/2019 FullText (36)

15/144

XIII

in green……............................................................ 69

Figure 5-1 Epipolar Geometry................................................... 76

Figure 5-2 Feature Points Detection. The red points in (a) and

(b) are the feature points. (a) is from the first

camera. (b) is from the second camera…………… 80

Figure 5-3 Epipolar Line……………………………………... 81

Figure 5-4 Rectified images, (a) is the rectified left image, (b)

is the rectified right image. (c) is the part of (a), (d)

is the part of (b)…………………………………… 81

Figure 5-5 An initial point cloud extracted with noise andoutliers.………………………………...…………. 82

Figure 5-6 View Dependent Texture Mapping...……………... 83

Figure 5-7 View Dependent Texture, Left: blurred texture.

Right: texture after the proposed view dependenttexture……………..………………………….…... 85

Figure 5-8 3D models of Ancient Chinese Artifacts. (a)

Dragon Vase, (b) Buddha, (c) Green Bottle, (d)Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine

Glass.………………….…………………………... 87

Figure 5-9 Rendering Results of Ancient Chinese Artifact…... 88

Figure 5-10 Iterative refinement of point cloud: (a) initial point

cloud. (b) point cloud after outlier detection and

Kalman filtering. (c) point cloud after the

proposed iteration method………………..………. 91

Figure 5-11 (a)-(b) shows the 3D to 2D re-projection at frame

20 and frame 21, respectively. Blue points are

inliers. Green points are outliers detected by the

segmentation consistency check. Red points are

the outliers detected by intensity and location

consistency checks. (c) shows the enlargement ofthe highlight area in (a). The point cloud is down-

sampled for better visualization…………………... 97

8/17/2019 FullText (36)

16/144

XIV

Figure 5-12 Convergence behavior of the root mean squaredistance (RMSD) versus the number of iteration

for the proposed iterative 3D reconstruction

algorithm. The blue line shows the RMSD values

with the KF-based outlier detection. The red lineshows the RMSD values without KF-based outlierdetection…………………………………………...

97

Figure 5-13 3D reconstruction results (a) without using RBF,

(b) using RBF without outlier detection and (c)

using RBF with outlier removal………………… 100

Figure 5-14 Object-based rendering results of “Podium”

sequences” using the estimated 3D model andshadow field at different lightening conditions.….. 101

Figure 5-15 Object-based rendering results of the “conference”

sequence. (a) and (b) are the 3D reconstruction

result of two time instances. (c) and (d) are the

rendering results of (a) and (b). Note, only partialgeometry of the dynamic object is recovered, since

it is partially observable………………………… 101

8/17/2019 FullText (36)

17/144

XV

List of Tables

Table 2-1 A taxonomy of plenoptic functions…….................... 12

Table 4-1 Comparison of the rank using standard threshold of

1 pixel on middlebury test stereo images…………... 67

8/17/2019 FullText (36)

18/144

XVI

List of Abbreviations

BRDF Bidirectional Reflectance Distribution Function

BP Belief Propagation

CLF Circular Light Field

CPU Central Processing Unit

CSA Cross-Sectional AreaDCP Disparity Compensation Prediction

DCT Discrete Cosine Transform

DSCs Digital Still Cameras

DSP Digital Signal Processing

Fig. Figure

fps frames per second

GC Graph Cut

GPU Graphic Processing Unit

HD High Definition

IBR Image-Based Rendering

i.i.d. independent identically distributed

ISKR Iterative Steering Kernel Regression

JVT Joint View Triangulation

KF Kalman Filter

LAN Local Area Network

L-BFGS Limited-memory Broyden-Fletcher-Goldfarb-Shanno

LDIs Layered Depth Images

LPR Local Polynomial Regression

8/17/2019 FullText (36)

19/144

XVII

LS Least Square

MCU Micro-Controller Unit

MI Mutual Information

M-IBR Moveable Image-Based Rendering

MRF Markov Random Field

MVC Maximal Voluntary Contraction

PCA Principal Component Analysis

pdf probability density function

PRT Pre-computed Radiance Transfer

QPP Quadratic Programming Problem

RANSAC RANdom SAmple Consensus

RBF Radial Basis Function

RF Rectus Femoris

R-ICI Refined Intersection of Confidence Intervals

RMSD Root-Mean Squared Distance

SCLF Simplified Circular Light FieldSFM Structure-From-Motion

SPIHT Set Partitioning In Hierarchical Trees

S-SFM Sequential-Structure-From-Motion

8/17/2019 FullText (36)

20/144

1

Chapter 1 Introduction

1.1 Background

Image-based rendering/representation (IBR) [Chen 1995], [Debe

1996], [Gort1996], [Levo 1996], [McMi 1995], [Pele 1997], [Szel 1997],

[Shad 1998], [Shum 1999] is a promising technology for rendering new

views of scenes from a collection of densely sampled images or videos.

It has potential applications in virtual reality, immersive television and

visualization systems. Central to IBR is the plenoptic function [Adel1911], which describes the intensity of each light ray in the world as a

function of visual angle, wavelength, time, and viewing position. The

plenoptic function is thus a 7-dimensional function of the viewing

position ),,( z y x V V V , the azimuth and elevation angles ),( , time ,

and wavelengths . Traditional images and videos are just 2D and 3D

special cases of the plenoptic function. In principle, one can reconstructany views in space and time if sufficient number of samples of the

plenoptic function is available. The rendering of novel views can

therefore be viewed as the reconstruction of the plenoptic function from

its samples. Image-based representations are usually densely sampled

high dimensional data with large data sizes, but their samples are highly

correlated. Because of the multidimensional nature of image-based

representations and scene geometry, much research has been devoted to

the efficient capturing, sampling, rendering and compression of IBR.

Depending on the functionality required, there is a spectrum of IBR

as shown in Fig. 1-1. They differ from each other in the amount of

geometry information of the scenes/objects being used. At one end of the

spectrum, like traditional texture mapping, we have very accurate

8/17/2019 FullText (36)

21/144

2

geometric models of the scenes and objects say generated by animation

techniques, but only a few images are required to generate the textures.

Given the 3- D models and the lighting conditions, novel views can be

rendered using conventional graphic techniques. Moreover, interactive

rendering with moveable objects and light sources can be supported

using advanced graphic hardware.

Figure 1-1 Spectrum of IBR representations [Chan 2010].

At the other extreme, light field or lumigraph rendering relies on

dense sampling (by capturing more image/videos) with no or very littlegeometry information for rendering without recovering the exact 3- D

models. An important advantage of the latter is its superior image quality,

compared with 3- D model building for complicated real world scenes.

Another important advantage is that it requires much less computational

resources for rendering regardless of the scene complexity, because most

of the quantities involved are pre-computed or recorded. This has

attracted considerable attention in the computer graphic community

recently in developing fast and efficient rendering algorithms for real-

time relighting and soft-shadow generation [Agra 2000], [Ng 2004],

[Sloa 2002], [Zhou 2005].

Broadly speaking, image-based representations can be classified

according to the geometry information used into three main categories: 1)

Rendering with

no geometry

Rendering with

implicit geometry

Rendering with

explicit geometryLight field

Concentric mosaics

Mosaicking

Lumigraph

View interpolation

View morphing

Layered-depth images

Texture-mapped models3D warping

View-dependent texture,Shadow light field

Less geometry More geometry

Less imagesMore images

View-dependent geometry

8/17/2019 FullText (36)

22/144

3

representations with no geometry, 2) representations with implicit

geometry and 3) representations with explicit geometry. 2-D Panoramas,

McMillan and Bisho p’s plenoptic modeling [McMi 1995], 3-D

concentric mosaics and light field/lumigraph belong to the first category

and they can be viewed as the direct interpolation of the plenoptic

function. Layered-based, object-based representations [Chan 2009], pop-

up light [Shum 2004] using depth maps fall into the second. Finally,

conventional 3-D computer graphic models and other more sophisticated

representations [Deve 1998], [Wang 2005] belong to the last category.

Although these representations also sample the plenoptic function,

further processing of the plenoptic function has been performed to infer

the scene geometry or surface property such as bidirectional reflectance

distribution function (BRDF) of objects. Such image-based modeling

approach has emerged as a more promising approach to enrich the

photorealism and user interactivity of IBR. Moreover, since 3-D models

of the scenes are unavailable, conventional image-based representationsare limited to the change of viewpoints and sometimes limited amount of

relighting. Recently, it was found that real-time relighting and soft-

shadow computation are feasible using the IBR concepts and the

associated 3-D models using pre-computed radiance transfer (PRT)

[Sloa 2002] and precomputed shadow fields [Zhou 2005].

For multiple camera arrays, the huge amount of data and vast

amount of viewpoints to be provided present one of the major challenges

to IBR. Advanced algorithms for processing and manipulation of the

high dimensional representation to achieve such functions as

segmentation, depth estimation, object tracking, 3D reconstruction, etc.

are all major challenges to be addressed. Finally, the efficient

transmission, compression and display of dynamic IBR and models are

8/17/2019 FullText (36)

23/144

4

also urgent issues waiting for satisfactory solution in order for IBR to

establish itself as an essential media for communication and presentation.

All of these motivate us to study the design and construction of the

image-based rendering systems based on plenoptic videos. The system

can potentially provide improved viewing freedom to users and ability to

cope with moving and static objects for 3D reconstruction.

1.2 Thesis Outline

This thesis is devoted to the design of image-based renderingsystems and its associating algorithms so as to provide improved

viewing freedom and object modeling of stationary and moveable

objects in outdoor and indoor environment. The major contributions of

this thesis are summarized as follows:

1) The construction of a high resolution IBR system for capturing and

rendering of ancient Chinese artifacts and a moveable IBR system

for capturing and rendering indoor and outdoor objects.

2) Development of a novel mutual information (MI)-based algorithm

combined with segmentation for dense depth map estimation and

object tracking.

3) A 3D reconstruction algorithm for objects, which employs the

estimated dense depth maps to obtain dense point correspondences

from multiple views for 3D reconstruction.

8/17/2019 FullText (36)

24/144

5

Details of these contributions are briefly described below:

1) The first prototype system uses a multiple still camera array to

capture ancient Chinese artifacts. Because of the high resolution of

the still camera (Canon 550D), we can obtain excellent rendering

quality. This system can be used for digital preservation and

dissemination of cultural artifacts with high digital quality. To avoid

possible damage to the artifacts and speed up the capturing process,

we propose to employ the image-based approach instead of using

traditional 3D laser scanners. A circular array consisting of multiple

digital still cameras was therefore constructed in this work. Using

this circular camera array, we developed novel techniques for

rendering new views of the artifacts from the images captured using

the object-based approach. The multiple views so synthesized enable

the ancient artifacts to be displayed in modern multi-view displays.

A number of ancient Chinese artifacts from the University Museum

and Art Gallery at the University of Hong Kong were captured and

excellent rendering results were obtained. The second prototype

system uses a linear camera array consisting of 8 video cameras

(Sony HDR-TGIE) mounted on an electrically controllable wheel

chair. Its motion can be controlled manually or remotely by means of

additional hardware circuitry. Unlike the previous multiple camera

systems which are not designed to be moveable so that the view-

points are somewhat limited and usually cannot cope with moving

objects and perform 3D reconstruction of objects in open

environment. Our moveable image-based rendering system can be

used to render large environment and moving objects.

8/17/2019 FullText (36)

25/144

6

2) A new combined segmentation-mutual-information (MI)-based

algorithm for dense depth map estimation is presented. It relies on

segmentation, local polynomial regression (LPR)-based depth map

smoothing and MI-based matching algorithm to iteratively estimate

the depth map. The method is very flexible and both semi-automatic

and automatic segmentations can be used. The semi-automatic and

automatic versions rank 4 and 6 respectively in the Middlebury

comparison of existing depth estimation methods. Using the depth

maps captured and the object-based approach, high quality

renderings of outdoor scenes along the trajectory can be obtained,

which considerably improved the viewing freedom. The mutual

information-based matching algorithm is also extended to object

tracking algorithms. It can be used to track the boundary of an object

in a video sequence. Experimental results show that its performance

is reliable even for noisy videos such as dynamic ultrasound images.

3) Using the IBR systems, correspondences from different views can be

integrated together for 3D reconstruction. For both of the systems,

camera calibration is firstly used to determine the value of internal

and external parameters of the cameras. For the still camera array, a

major technique to find the correspondent points is the epipolar

geometry which can constraint the corresponding points on the

conjugated epipolar lines. Meanwhile, via combining epipolar lines

with Scale-invariant feature transform (SIFT) [Lowe 2004] feature

detection, accurate sparse correspondent points can be located. Then

Gabor filter, which is rather insensitive to noise, is used to obtain the

dense correspondent points. For the moveable image-based rendering

(M-IBR) system, the sequential-structure-from-motion (S-SFM)

technique is adopted to estimate the locations of the M-IBR system

8/17/2019 FullText (36)

26/144

7

so as to obtain an initial set of fairly reliable 3D point cloud from the

2D correspondences. New iterative Kalman filter (KF)-based and

segmentation-MI-based algorithms are proposed to fuse the

correspondences from different views and remove possible outliers

to obtain an improved point cloud. More precisely, the proposed

algorithm relies on the KF to track the correspondences across

different views so as to suppress possible outliers while fusing

correspondences from different views. With these reliable matched

points, the camera parameters and hence the image correspondences

can be further refined by re-projecting the updated correspondences

to successive views to serve as prior features/correspondences for

MI-based matching. By iterating these processes, an improved point

cloud with reliable correspondences can be recovered. Simulation

results show that the proposed algorithm significantly reduces the

adverse effect of the outliers and generates a more reliable point

cloud. To recover the 3D model from the improved point cloud, anew robust RBF-based modeling algorithm is proposed to further

suppress possible outliers and generate smooth 3D surfaces from the

raw 3D point cloud. Compared with the conventional RBF-based

smoothing, it is more robust and reliable. Finally, view dependent

texture is incorporated to enhance the final rendering effect.

This thesis is divided into 6 chapters. In Chapter 2, some background

materials on IBR are briefly reviewed. They include plenoptic

function, light field and rendering techniques. In Chapter 3, the

design and construction of two IBR systems are presented. Some

pre-processing techniques for capturing the ancient Chinese artifacts

8/17/2019 FullText (36)

27/144

8

including camera calibration and color-tensor-based segmentation

are also introduced. Chapter 4 is devoted to a new combined

segmentation MI-based depth estimation algorithm. The 3D

reconstruction and modeling algorithms will be presented in Chapter

5.Two different point matching algorithms are studied first and then

a RBF modeling algorithm is proposed for mesh generation. At last,

a view dependent texture mapping method for improving the

rendering quality will be presented. Finally, conclusion and future

research topics are given in Chapter 6.

8/17/2019 FullText (36)

28/144

9

Chapter 2 Review of Basic Topics in Image-Based

Rendering

2.1 Introduction

In this chapter, the fundamental topics in image-based rendering are

reviewed briefly. In section 2.2, the plenoptic function and its history are

introduced. The theory of light field is dicusssed in section 2.3. Section

2.4 is devoted to the rendering techniques in IBR.

2.2 Review of Plenoptic Function

The plenoptic function was proposed by Bergen and Anderson

[Adel 1991]. It is a function presented by visual angle, wavelength, time

and viewing position to describe the intensity of each light ray in the

world. All the information captured by an optical sensor can be depicted

by this function. The plenoptic function is a 7 dimension (7D) function

consisting of 3 dimenion (3D) position, 2 dimension visual angle ,

wavelength and time.

Sampling and processing of the plenoptic function are the main

research topic in the early computer vision study. For example, object

motion can be discribed by the derivatives of the plenoptic function in

terms of position and time. Because the wavelength is usually

represented by Red, Green, Blue channels in digital image processing,

the plenoptic function of images and videos can be simplified into two

dimension and three dimension special cases. Theoretically, if the

sampling rate is very high, novel views at intermediate positions can be

recovered from its samples. The algorithms which are trying to solve this

problem are usually called image-based rendering.

8/17/2019 FullText (36)

29/144

10

θ

ϕ

l(x,y,z,θ,ϕ )

(x,y,z)

î ĵ

k ˆ

Figure 2-1: Light field describes the amount of light in radiance along light rays

traveling in every direction through every point in empty space [Ikeu 2012].

Because the plenoptic function also describes the geometry and

surface properties, many algorithms are proposed to integrate the

geometry and surface information into image-based rendering to

improve user interaction and reduce the amount of samples required.

The capturing, sampling, rendering and processing of the plenoptic

function are all important research topics in IBR and related applications

such as computational photography, 3D/multiview videos and displays,

etc.

2.2.1 Basic Theory

The 7D plenoptic function is usually defined as

),,,,,,( z y x V V V l . ),,( z y x V V V is the viewing position. and are

the elevation and azimuth angles respectively as shown in (Fig. 2-1).

and denote the wavelength and time respectively. By employing

different parameterization and simplification, different image-based

rendering algorithms can be derived from the plenoptic function.

8/17/2019 FullText (36)

30/144

11

There are several camera systems which are usually used in the

image-based rendering for capturing. For static scene, one camera can be

rotated around the camera centre at a given position V with different

elevation and azimuth angles. The plenoptic function is then simplified

to a panorama ),( V l . The spherical camera array can provide other

panoramas representation because the captured image can be projected

to a cylinder. If a multiple video camera array is employed instead, a

panoramic video can be obtained. And the plenoptic function will be

simplified to a 3D panorama ),,( V l for dynamic scenes. The close

relationship between plenoptic function and image-based rendering was

due to McMillan and Bishop [McMi 1995] who proposed plenoptic

modeling using the 5D complete plenoptic function for static scene

),,,,( z y x V V V l .

In the static scene, the radiance along rays is constant. Therefore

the plenoptic function can be re-written as a 4D function which is called

the light field [Levo 1996] and lumigraph [Gort 1996] in computer

graphics. The set of light rays in a 4D static light field can be

parameterized in many ways. For example, the two-plane-based

parameterization is usually used. By adding the time into the static light

field, a 5D plenoptic function can be obtained. Lumigraph employs

depth maps into image-based rendering to improve the rendering quality

which can produce more accurate representations. In [Shum 1999], an

outward facing camera moving on a circle was used to capture a series

of densely sampled images.

A commonly used parameterization is the two-plane

parameterization, where a light ray in the light field is parameterized as

its intersections or coordinates with two parallel planes. These rays can

8/17/2019 FullText (36)

31/144

12

Table 2-1: A taxonomy of plenoptic functions.

be captured by taking a series of pictures on a 2D rectangular plane,

which results in an array of images. The light field concept can be

similarly extended to time-varying or dynamic scenes, which results in a

5D function. Lumigraph is different from the light field, because the

geometry in form of depth maps is used to improve the rendering quality

which can produce more sophisticated representations in image-based

modeling. In [Shum 1999], a set of densely sampled images are captured

by an outward facing camera moving on a circle which are called

concentric mosaic. This system can render new views inside the circle.

Then some simplified systems are proposed to reduce the complexity

such as restricting the camera locations to line by line segments [Zitn

2004], [Chan 2005], [Chan 2009]. For time varying or dynamic scenes,

similar representations can be used. Because the light may change at

different viewing location, the light has to be captured continuously

which can be done by video camera arrays. For static scenes, the light

directions can be recorded at first. Then one can relight the rendering

Dimension Year View space Name

7 1991 Free Plenoptic function

5 1995 Free Plenoptic modeling

4 1996 Bounding box Light field/Lumigraph

3 1999 Bounding circle Concentric Mosaics

2 1994 Fixed point Cylindrical/Spherical panorama

8/17/2019 FullText (36)

32/144

13

with arbitrary lightings. A brief summary of these plenoptic function

representations is given in Table 2-1 [Ikeu 2012].

2.3 Review of Light Field

Light field was introduced firstly in a paper by A. Gershun [Gers

1939] for studying surface illumination by artificial lightings. A similar

concept was introduced to the computer graphics community as the light

field in [Levo 1996] and lumigraph in [Gort 1996]. The motivation is to

render new views or images of objects or scenes from densely sampled

images previously taken to avoid building or capturing complicated 3D

models. Light field or lumigraph rendering is a special representation of

image-based rendering and they require either no geometry [Levo 1996]

or limited geometry in terms of depth maps [Gort 1996]. The Light field

and lumigraph are four dimension (4D) simplification of the plenoptic

function for static scenes.

2.3.1 Creating/capturing light field

Light field can be created by rendering 3D models by computer

graphics or capture the real object by camera arrays. For real and static

scenes, light field can be captured by one still camera controlled by a

mechanical arm using lumigraph rendering [Bueh 2001]. In [Adel 1991],

[Ng 2005], a lenticular lens array was used to capture the light field. In

[Veer 2007], [Lian 2008], a coded aperture ,which can map rays from

different directions to near pixels in the sensor array, was used to record

images. The pixels of these images consist of a set of pixels recording

light from different direction. The novel views can be estimated by

combining these 4D samples in the light field. In [Ng 2005], a mirolens

array was placed in front of the handheld digital camera. The images

8/17/2019 FullText (36)

33/144

14

captured in this way can be refocused after they have been taken. The

light field video can be obtained in a similar way.

Multiple camera systems are usually used to achieve large disparity

in dynamic scenes. Much research effort has been devoted to the

construction of 2D camera arrays. To simplify the capturing hardware,

light field captured on line segments and circular arc have also been

reported in Section 2.2.

2.4 Review of Rendering Techniques

Rendering is the process to create new view from several imagesand other auxiliary information obtained in the representations. In the

early stage of image-based rendering, it did not employ any geometry

information. Image blending in panoramas [Chen 1995] and ray space

interpolation in light field [Levo 1996] are used to do image rendering.

Each ray that goes through a target pixel is mapped to nearby sampled

rays in ray space interpolation. Since some sophisticated representations

use more geometry information such as layered depth images [Shad

1998], surface light field [Wood 2000], and pop-up light field [Shum

2004], graphics hardware has been exploited to accelerate the rendering

process. The geometry information can either implicitly rely on

positional correspondences or explicitly in the form of depth along

known lines-of-sight or 3D coordinates. Representations of the former

usually involve weakly calibrated cameras and rely on image

correspondences to render new views, say by triangulating two reference

images into patches according to the correspondences as in joint view

triangulation (JVT) [Lhui 2003]. These include view interpolation, view

morphing, JVT and transfer methods with fundamental matrices and

trifocal tensors. Representations employing explicit geometry include

8/17/2019 FullText (36)

34/144

15

sprites, relief textures, Layered Depth Images (LDIs), view-dependent

texture, surface light field, pop-up light field, shadow light field, etc.

In general, the rendering methods can be broadly classified into

three categories: 1) point-based, 2) layer-based, and 3) monolithic.

Point-based rendering works on 3D point clouds or point

correspondences and typically each point is rendered independently.

Points are mapped to the target image plane through forward mapping.

For the 3D point X in Fig. 2-2, the mapping can be written as

where t x and r x are homogeneous coordinates of the projection of X on

target screen and reference images, respectively. C and P are camera

center and projection matrix respectively and is a scale factor. Since

t C ,

t P and the focus length

t f are known for the target image, t can be

computed using the depth of X . Givenr

x and r , one can compute the

exact position oft

x on the target screen and transfer the color

accordingly. Gaps or holes may exist due to magnification. Disocclusion

and splatting techniques have been proposed to solve this problem. The

painter ’s algorithm is frequently used to avoid the problem that multiple

pixels from the reference view are mapped to the same pixel in the target

image.

Layered techniques usually separate the scene into a group of

planar layers consisting of a 3D plane with texture and optionally a

transparency map. The layers can be thought of as a continuous set of

polygonal models, which are amenable to conventional texture mapping

and view-dependent texture mapping. Usually, each layer is rendered

using either point-based or polygon meshes as in monolithic rendering

t t t r r r x P C x P C X t r , (2-4-1)

8/17/2019 FullText (36)

35/144

16

techniques before being composed in the back-to-front order using the

painter’s algorithm to produce the final view. Layer-based rendering can

be implemented easily using graphic processing unit (GPU). Since the

rendering of IBR requires very low complexity, it is even possible to

perform the calculation using central processing unit (CPU) by working

on individual layer or object [Chan 2009].

Monolithic rendering usually represents the geometry as continuous

polygon meshes with textures, which can be readily rendered using

X

xr xt

Cr

Cter

et

X

xr xt

Cr

Cter

et

Figure 2-2: Forward mapping.

8/17/2019 FullText (36)

36/144

17

graphics hardware. The 3D model normally consists of vertices, normals

of vertices, faces, and texture mapping coordinates. The data can be

stored in a variety of data formats. The most popular formats

are .obj, .3ds, .max, .stl, .ply, .wrl, .dxf, etc.

Relighting, shadow generation and interactivity have played an

increasingly important role in 3D interactive rendering. The most

popular algorithms are shadow mapping, shadow volume, ray-tracing,

pre-computed radiance transfer, pre-computed shadow field, etc. Some

of them have better rendering quality, while others are more efficient for

real time rendering. Thanks to the development of GPU, basic lighting

and shading algorithms like shadow mapping and shadow volume have

been realized on the fly. Modern GPUs can even offer programmable

rendering pipelines for customized rendering effects and “shader ” is a

set of software instructions running on these GPUs to control the

pipelines. Using shader programming, high quality shadow rendering

algorithms like precomputed shadow field can be done in real time. Fig.

2-3 shows examples renderings of the three techniques.

Though there has been substantial progress in capturing,

representing, rendering and modeling scenes, the ability to handle

general complex scenes remains challenging for IBR. A lot of work is

still required to ensure robustness in handling reflection translucency,

highlights, depth estimation, capturing complexity, object manipulation,

etc. Interacting with IBR representations remains challenging because

IBR uses images for rendering. Recent approaches have been focused on

using advanced computer vision techniques, such as stereo/multiview

vision and photometric stereo, and depth sensing devices to extract more

geometry information from the scene so as to enhance the functionalities

8/17/2019 FullText (36)

37/144

18

of IBR representations. While there has been considerable progress in

relighting and interactive rendering of individual real static objects, such

operations are still difficult for real and complicated scenes. For

dynamic scenes, the huge amount of data and vast amount of viewpoints

to be provided present one of the major challenges to IBR. Advanced

algorithms for processing and manipulation of the high dimensional

representation to achieve such functions as object extraction, model

completion, scene inpainting, etc. are all major challenges to be

addressed. Finally, the efficient transmission, compression and display

of dynamic IBR and models are also urgent issues waiting for

satisfactory solution in order for IBR to establish itself as an essential

media for communication and presentation. All of these motivate us to

study the design and construction of new image-based rendering systems

based on plenoptic videos. The system can potentially provide improved

viewing freedom to users and ability to cope with moving and static

objects and perform 3D reconstruction.

2.5 Summary

In this chapter, the basic topics in image-based rendering have been

reviewed. The plenoptic function which serves an important concept for

describing visual information in our world was introduced. Then a brief

review on light field was given. In fact, how to achieve high quality

rendering and display light field with a wide range of viewing positions

in large scale environmental will be studied in Chapters 3 and 4. Finally,

some rendering techniques including point-based, layer-based, and

monolithic methods are discussed. An extension of these rendering

techniques will be further studied in Chapter 5.

8/17/2019 FullText (36)

38/144

19

(a)

(b)

(c)

Figure 2-3: Example renderings using (a) forward mapping in point rendering [Chan

2005], (b) layered representation (with two layers – dancer and background) [Chan

2009], (c) monolithic rendering using 3D polygonal mesh (left) and rendering results

(right) [Zhu 2010].

8/17/2019 FullText (36)

39/144

20

Chapter 3 The Proposed Image-based Rendering

Systems

3.1 Introduction

Both IBR systems are based on the simplified light field. As

mentioned earlier, two IBR systems are constructed and studied in this

thesis, one for capturing and rendering ancient Chinese artifacts and the

other for environmental modeling. They belong to the general class of

image-based representations. Since capturing 3D models in real-time is

still a very difficult problem, light field- or lumigraph-based dynamic

IBR representations with little amount of geometry information have

received considerable attention in immersive TV (also called 3D or

multi-view TVs) applications. Because of the multidimensional nature of

the plenoptic function and the scene geometry, much research has been

devoted to the efficient capturing, sampling, rendering and compression

of IBR. There has been considerably progress in these areas since the pioneer work of lumigraph by Gortler et al [Gort 1996] and light field by

Levoy and Hanrahan [Levo 1996]. Other IBR representations include the

2D panorama [Szel 1997, Pele 1997], Chen and Williams’ view

interpolation [Chen 1993], McMillan and Bishop’s plenoptic modeling

[McMi 1995], layer depth images [Shad 1998] and the 3D concentric

mosaics [Shum 1996], etc. Motivated by light field and lumigraph, the

predecessors in the author ’s lab have developed a real-time system for

capturing and rendering a simplified dynamic light field called the

“plenoptic videos” [Chan 2003], [Chan 2004], [Chan 2005], [Chan

2009], [Gan2005] with four dimensions. It is a simplified dynamic light

field, where videos are taken along line segments as shown in Fig. 3-1,

8/17/2019 FullText (36)

40/144

21

instead of a 2D plane, to simplify the capturing hardware for dynamic

scenes.

Figure 3-1: Plenoptic videos: Multiple linear camera array of 4D simplified dynamic

light field with viewpoints constrained along line segments. The camera arrays

developed at [Chan 2009]. Each consists of 6 JVC video cameras.

Pioneer projects in cultural heritage preservation of large scale

structure and sculptures include the Digital Michelangelo Project [Levo

2002], the 3D facial reconstruction and visualization of ancient Egyptian

mummies [Atta 1999], the great Buddha Project [Ikeu 2003], to name

just a few. To avoid possible damage to the ancient artifacts and speedup the capturing process, we propose to employ the image-based

approach instead of using 3D laser scanners. A circular array consisting

of multiple digital still cameras (DSCs) was therefore constructed in this

thesis to capture the simplified light field of the ancient artifacts along

circular arcs, which we shall call the simplified circular light field

(SCLF) or circular light field (CLF) in short. The circular array is chosen

to provide users with a better visual experience, because it supports fly

over effect and close-up of the artifacts uniformly in the angular domain.

We also developed novel techniques for rendering new views of the

ancient artifacts from the images captured using the object-based

approach. The details will be discussed later in Chapters 4 and 5. A

number of ancient Chinese artifacts from the University Museum and

8/17/2019 FullText (36)

41/144

22

Art Gallery at The University of Hong Kong were captured and

excellent rendering results in ordinary as well as 3D/multiview displays

were achieved. The proposed IBR system and associated algorithms

server as a framework for culture preservation of media-sized ancient

artifacts.

While there are considerable IBR systems proposed previously, few

IBR systems are moveable. Therefore, another objective in this thesis is

to design a moveable IBR system for modeling objects in outdoor

environment. The moveable IBR system proposed uses a linear camera

array consisting of 8 video cameras mounted on an electrically

controllable wheel chair. Its motion can be controlled manually or

remotely by means of additional hardware circuitry. Unlike the previous

multiple camera systems which are not designed to be moveable so that

the view-points are somewhat limited and usually cannot cope with

moving objects and perform 3D reconstruction of objects in open

environment. Our moveable image-based rendering system can be used

to render large environment and moving objects. In particular, the

system supports object-based rendering and 3D reconstruction capability

and consists of two main components. 1) A novel view synthesis

algorithm using a new segmentation and mutual-information (MI)-based

algorithm for dense depth map estimation, which relies on segmentation,

LPR-based depth map smoothing and MI-based matching algorithm to

iteratively estimate the depth map. The method is very flexible and both

semi-automatic and automatic segmentation methods can be employed.

They rank fourth and sixth, respectively, in the Middlebury comparison

of existing depth estimation methods. This allows high quality

renderings of outdoor scenes with improved mobility/freedom to be

obtained. 2) A new 3D reconstruction algorithm which utilizes

8/17/2019 FullText (36)

42/144

23

sequential-structure-from-motion (S-SFM) technique and the dense

depth maps estimated previously. It relies on a new iterative point cloud

refinement algorithm based on Kalman filter (KF) for outlier removal

and the segmentation-MI-based algorithm to further refine the

correspondences and the projection matrices. The mobility of our system

allows us to recover more conveniently 3D model of static objects from

the improved point cloud using a new robust Radial basis function

(RBF)-based modeling algorithm to further suppress possible outliers

and generate smooth 3D meshes of objects. Experimental results show

that the proposed 3D reconstruction algorithm significantly reduces the

adverse effect of the outliers and produces high quality renderings using

shadow light field and the model reconstructed. The details will be

discussed later in Chapters 4 and 5.

The rest of this chapter is devoted to the general design and

construction of the systems. More precisely, Section 3.2 is devoted to the

design and configuration of the IBR systems. Section 3.3 presents some

pre-processing including camera calibration, color-tensor-based

segmentation and matting. Finally, conclusions are drawn in Section 3.4.

3.2 Construction of the Proposed IBR systems

3.2.1 Still Camera System

As mentioned previously in Section 3.1, the first prototype system

consists of an array of 13 Canon 550D cameras mounted on a camera

stand. The images/videos will be captured and then be processed and

viewed on a multiview TVs. A circular array is chosen to provide users

with a better visual experience, because it emulates “fly over” and

“rotate” kind of special effects. Fig. 3-2 shows the proposed capturing

system. Fig. 3-3 shows some snapshots captured by this system called

8/17/2019 FullText (36)

43/144

24

Buddha and Dragon Vase. The resolution of these images is 3465×2304.

The operation flow is illustrated in Fig. 3-4. Firstly, the objects are

captured by this system from different angles. Then we need to segment

the objects by color tensor which is insensitive to shadow and shading.

The natural matting can be adopted to improve the rendering quality

when objects are mixed on other backgrounds. From the segmented

objects, approximated geometry information for each object can be

estimated by point-based matching for rendering and 3D reconstruction.

Finally other rendering techniques such as shadow field re-lighting and

view dependent texture mapping will be added in the rendering. The

details of these algorithms will be discussed in the rest of the current

chapter and next chapter. .

Figure 3-2: Circular camera array constructed.

8/17/2019 FullText (36)

44/144

25

(a)

(b)

Figure 3-3: Snapshots: (a)Buddha (b) Dragon Vase.

8/17/2019 FullText (36)

45/144

26

Figure 3-4: Block diagram of the proposed IBR system.

3.2.2 Moveable Camera System

The second moveable IBR (M-IBR) system consists of a linear

array of cameras mounted on an electrically controllable wheel chair so

as to cope with moving objects in large environment and hence improve

the viewing freedom of users. Fig. 3-5 shows the moveable IBR system

that we have constructed. It consists of a linear array of 8 Sony HDR-

TGIE high definition (HD) video cameras which is mounted on a

FS122LGC wheel chair.

The motion of the wheel chair is originally controlled manually

through a VR2 joystick and power controller modules from PG drives

[PGDT] technology. To make it electronically controllable, we

examined the output of the joystick and generated the (x-,y-) motion

control voltages to the power controller using a Devasys USB-I2C/IO

[USBI] micro-controller unit (MCU). By appropriately controlling these

voltages, we can control the motion of the wheel chair electronically.

8/17/2019 FullText (36)

46/144

27

Figure 3-5: The proposed moveable image-based rendering system.

Moreover, by using the wireless LAN of a portable notebook mounted

on the wheel chair, its motion can be controlled remotely. By improving

the mobility of the IBR capturing system, we are able to cope with

moving objects in large environment.

The HD videos are captured in real-time into the storage cards of

cam-corders. They can be downloaded to PC for further processing such

as calibration, depth estimation, and rendering using the object-based

approach. For real-time transmission, the cam-corders are equipped with

a composite video output which can be further compressed and

transmitted. To illustrate the concept of multiview conferencing, a

ThinkSmart IVS-MV02 Intelligent Video surveillance system [IVS] was

used to compress the (320x240) 30 frames/sec videos online, which can

be retrieved remotely through the wireless LAN for viewing or further

8/17/2019 FullText (36)

47/144

28

processing. The system is built from Analog Device DSP and real-time

compression at a bit rate of 400kbps.

Before the cameras can be used for depth estimation, they must be

calibrated to determine the intrinsic parameters as well as their extrinsic

parameters, i.e. their relative positions and poses. This can be

accomplished by using a sufficient large checkerboard calibration

pattern. We follow the plane-based calibration method [Zhan 2000] to

determine the projective matrix of each camera, which connects the

world coordinate and the image coordinate. The projection matrix of a

camera allows a 3D point in the world coordinate be translated back to

the corresponding 2D coordinate in the image captured by that camera.

This will facilitate depth estimation. Fig. 3-6 shows snapshots of an

outdoor and indoor videos captured by the proposed system called

“ podium” and “ presentation” , respectively. The resolution of these real-

scene videos is i10801920 with 25frames per second (fps) in 24-bit

RGB format. The system flow of the proposed moveable IBR system is

summarized in Fig. 3-7. Firstly we need to stabilize the video to reduce

the shaky motion frequently encountered in typical moveable IBR

systems. Then, a novel view synthesis algorithm using a new

segmentation and mutual-information (MI)-based algorithm for dense

depth map estimation is used to iteratively estimate the depth map.

Finally we need to reconstruct the 3D model using a new 3D

reconstruction algorithm which utilizes sequential-structure-from-motion

(S-SFM) technique and the dense depth maps estimated previously. A

new robust radial basis function (RBF)-based modeling algorithm is

used to further suppress possible outliers and generate smooth 3D

meshes of objects.

8/17/2019 FullText (36)

48/144

29

(a)

(b)Figure 3-6: Snapshots of the plenoptic videos at a given time instance: (a) is the

“ Podium” outdoor video from camera 1 to camera 4 and (b) is the “ Presentation”

indoor video from camera 1 to camera 4.

8/17/2019 FullText (36)

49/144

30

Video

stablization

Segmentation-

MI-based Depth

Estimation

Depth Map

Refinement

Image-based

Rendering

3D

Reconstruction

Figure 3-7: Block diagram of the proposed M-IBR system constructed.

3.3 Pre-Processing

3.3.1 Still Camera System

In order to speed up the whole processing procedure on the

proposed still camera system, some pre-processing needs to be done at

start. At first, all the cameras need to be calibrated. Because this system

is still, intrinsic and extrinsic parameters of cameras can obtained

precisely by following the plane-based calibration method. The proposed

still camera system will only focus on the objects we are interested. The

objects will be segmented out of the images for reducing the noise from

background.

3.3.1.1 Camera Calibration

In computer vision, the link between the 3D real world points and

image pixels is the camera parameters. The camera parameters contain

the extrinsic parameters and intrinsic parameters. Estimation of the

extrinsic and intrinsic parameters is called camera calibration [Truc

8/17/2019 FullText (36)

50/144

31

1998]. The extrinsic parameters define the translation between the

camera reference frame and the world reference frame. A 3D translation

vector T and a 3 × 3 rotation matrix R are used to represent the extrinsic

parameters [Truc 1998]. The relationship (see Fig. 3-8) between the

point in the world and the camera frame is

XW

Pc

PW

Yc

Xc

Zc

YW

ZWR, T

Figure 3-8 Relationship between the world coordinate and the camera coordinate.

The intrinsic parameters are defined in form of camera matrix C:

where x f and y f represent the focal length of the camera in terms of x

and y direcition. x

c and yc are the coordinate value of the principal point.

d is the skew parameter which is zero for pinhole cameras. For the

uncertainty of the camera types, the skew parameter will be set. By

combining the extrinsic parameters and intrinsic parameters, perspective

projection matrix equation will be:

)( T P R P wc . (3-3-1)

100

0 y y

x x

c f

cd f

C , (3-3-2)

8/17/2019 FullText (36)

51/144

32

where (c x , c y , c z ) is the point in the image coordinate system and

(w

X ,wY , w Z ,1) is the point in the world coordinate system. “ ” is dot

product. Both of the coordinate systems are in homogeneous coordinate

system. By defining the projective matrix P as

where P is a 3 × 4 matrix, the equation (3-3-3) can be rewriten as

Camera calibration is to estimate the matrix P . Zhang has proposed

an algrothm for camera calibration by using planar pattens [Zhan 1999].

The planar pattern is usually chosen as a chessboard like plane as shown

in Fig. 3-9, which is used in our system.

1

W

W

W

c

c

c

Z

Y

X

z

y

x

T | I CR , (3-3-3)

T | I CR P ,(3-3-4)

1

W

W

W

c

c

c

Z

Y

X

z

y

x

P . (3-3-5)

8/17/2019 FullText (36)

52/144

33

Figure 3-9 Planar patten.

The basic procedure of Zhang’s algorithm is:

1. Take a few image of the test pattern in different orientations.

2. Detect the feature points in the test images (often the corners).

3. Estimate the five intrinsic parameters (no skew paramter) and the

extirnsic paramters by using the closed-form solutions.

4. Estimate the radial distortion by solving linear least squares.

5. Refine all the parameters by minimizing error functions.

In this work the plane-based algorithm will be changed slightly to

fit my situation. The skew parameter will be added and the distortion

will not be estimated at first. Not only radial distortion, but also the

tangential distortion will be estimated.

3.3.1.2 Color-Tensor-based Segmentation and Matting

The first step to process these images is to segment the objects out

of the images. In the still camera system, we employ the photometric

invariant features [Weij 2006] to extract the foreground from the

8/17/2019 FullText (36)

53/144

34

monochromatic screen background. More precisely, the color tensor

describes the local orientation of a color vector f ( x, y) as:

y

T

y x

T

y

y

T

x x

T

x

y x f f f f

f f f f

T ),( , (3-3-6)

where f ( x, y) is a vector which contains the color component values at

position ( x, y) and the subscripts x and y in ),( y x xf and ),( y x yf denote

respectively the derivative of f ( x, y) with respect to x and y, the image

coordinates. According to [Weij 2006], the color vector can be seen as a

weighted sum of two component vectors: )(],,[ iibbT mme BG R c c

where bc is the color vector of the body reflectance, ic is the color

vector of the interface reflectance (i.e. specularities or highlights), bm

and im are scalars representing the corresponding magnitudes of

reflection and e is the intensity the light source. Thus

ii x xi

b xbb x xbb

T

x

memememeemG B R

c c c

))(( ))(()(],,[

,

(3-3-7)

which suggests that the spatial derivative is a sum of three weighted

vectors, successively caused by body reflectance, shading-shadow and

specular changes. For matte surfaces, the intensity of interface

reflectance is zero (i.e. mi=0) and the projection of the spatial derivative

xf on the shadow-shading axis is the shadow-shading variant containing

all energy which can be explained by changes due to shadow and

shading. The shadow-shading axis direction is bc which is parallel to

bbem c f for matte surfaces. So the projection 1s of the spatial

derivative xf on the shadow-shading axis is

8/17/2019 FullText (36)

54/144

35

||||/||)||/(1 f f f f f s T

x.

(3-3-8)

Subtraction of the shadow-shading variant1s from the total derivative

xf results in the shadow-shading quasi-invariant 12 s f s x . In

summary, the derivative of the color tensor can be separated into

shadow-shading variant part1s and shadow-shading invariant part

2s .

The shadow-shading invariant part does not contain the derivative

energy caused by shadows and shading. To construct a shadow-shading-

specular quasi-invariant, this part is combined with the hue direction,

which is perpendicular to the light source direction c i and the shadow

and shading direction c b. Therefore the hue direction is

bibi c c c c h /)( .

(3-3-9)

The projection of the derivative on the hue direction is the desired

shadow-shading-specular-quasi-invariant part:

||||/||)||/( h h h h f H T

x . (3-3-10)

By replacing xf in the color tensor equation (1) by 2s or H , we can get

the shadow-shading-specular-quasi-invariant color tensor and the

shadow-shading invariant color tensor respectively. By setting a suitable

threshold value for the color tensor, we can detect the boundary of the

object. Fig. 3-10 shows some segmentation results that were obtained

using the color tensor method, followed by Bayesian matting for

extracting a foreground from the background. After segmentation, the

hard boundary of the object will be obtained. Matting can then be

applied to obtain soft segmentation information, called the matte, of the

object. The matte, which is an image containing the portion of

foreground with respect to the background (from 0 to 1) at a particular

8/17/2019 FullText (36)

55/144

36

location, greatly improves the visual quality of mixing the objects onto

other backgrounds.

(a)

(b)

Figure 3-10:(a) Extraction results using color-tensor-based method. Left: original,

middle: hard segmentation, right: after matting. (b) Close up of segmentations in

(a). Left: hard segmentation, Right: after matting.

3.3.2 Moveable Camera System

Unlike the static camera system described before, the moveable

camera system will experience shaky motion during movement and

hence video stabilization has to be performed.

3.3.2.1 Video Stabilization

To ensure good tracking of objects and to obtain more image

samples for high quality rendering, the wheel chair is usually driven

steadily during capturing. However, one problem with M-IBR system is

8/17/2019 FullText (36)

56/144

37

that the ground surfaces may not be smooth and the whole mechanical

structure can vibrate considerably during movement. In our M-IBR

system, the shaky motion of the camera array of the outdoor

environment seems to come from the roughness of the ground surfaces

and the vibration of the mechanical structure during the movement.

Besides, the video captured may also appear shaky when the system is

moving and about to settle down in indoor environment. To reduce these

annoying effects, video stabilization [Hu 2007], [Mats 2005], [Mats

2006], [Rata 1998] is frequently employed to eliminate the undesired

motion fluctuation in the captured videos.

As mentioned above, our M-IBR system was driven steadily during

capturing. Therefore, the undesired motion fluctuation will usually

appear as high frequency components compared to the intentional

motion. As a result, the problem of video stabilization can also be

viewed as the removal of high frequency components in the estimated

velocity. To this end, one needs to estimate the global motion of the

camera, say by mean of optical flow on the video sequence, so that this

annoying high frequency local motion can be removed to stabilize the

videos.

The proposed video stabilization algorithm is divided into three

major steps as follows. 1) Global motion estimation: firstly, the

geometric transformation between a location T x x ],[ 21x in a frame

with that in an adjacent frame, 'x , is modeled by an affine transformation

t Ax x T x ][' , where T x x t t ],[ 21t is the translational component

and the affine rotation, scaling, and stretch are represented by the matrix

43

21

aa

aaA . In homogeneous coordinate, T

h x x ]1,,[ 21x , T can be

8/17/2019 FullText (36)

57/144

38

conveniently represented by a matrix multiplicationhh

x T , where

10

t AT h . T is estimated from the tracked features in adjacent video

frames using the scale invariant feature transformation (SIFT) [Lowe

2004], instead of the Lucas-Kanade tracker in [Chan 2010]. 2) Local

smoothing of motion: the intentional motion, which is assumed to be

slow and smooth, is then obtained by smoothing the global motion

estimated using local polynomial regression (LPR) with adaptive

bandwidth selection [Zhan 2009]. Unlike conventional methods, the

bandwidth or window size for smoothing can be automatically

determined. This will be further discussed below. 3) Video Completion:

the uncovered areas are filled using motion inpainting [Mats 2005],

[Mats 2006].

We now describe each step in more details. Let N t I t ,,0|)( x ,

where T x x ],[ 21x , 111 x , 221 x , be a video sequence

consisting of N video frames with resolution 21 captured by our

M-IBR system. Consider the global motion transformation up to time

instant t , },,{ 11

0

t

t T T , where1i

iT is the coordinate transformation from

the i-th to the (i+1)-th frame. If 1iiT is smoothed separately, a smoothed

transformation chain }ˆ,,ˆ{ 11

0

t

t T T is obtained and the t-th compensated

image frame t I' can be obtained as:

)(]))[ˆ(( 1

0

1

1 x x T T t t

i

i

i

i

it I I

, (3-3-11)

wherei

i 1T and1ˆ i

iT denote respectively the transformation from

frame i+1 to i and the smoothed transformation from frame i to 1i . In

order to avoid error accumulation due to the cascade of original and

8/17/2019 FullText (36)

58/144

39

smoothed transformation chains, [Mats 2005] proposed to compute

directly the transformationt

T ~

from the current frame )(x t I to the

corresponding motion compensated frame )(x t I using only the

neighboring transformation matrices as

t i

i

t t iG )(~

T T , where

t f t :{ } t f denotes the indices of neighboring frames,

22 2/1)2()( xe xG is a Gaussian kernel, 2 is the support of t

or window size, and denotes the element-wise convolution operation.

It can be seen that the selection of the kernel size affects the degree

of smoothing. A large kernel size will lead to the problem of over-

smoothing, while a small kernel size may not be able to remove the high

frequency undesirable motion. The green and black lines in Fig. 3-11

illustrate the effect of using a small kernel size of 3 and a large

kernel size of 20 , respectively, using the method in [Mats 2005].

To address this issue, we propose a new method for choosingadaptively the kernel size using local polynomial regression (LPR) with

adaptive bandwidth selection. The close relationship between curve

fitting and video stabilization has been recognized for example in [Hu

2007], where a local parabolic fitting is used to compute the smoothed

motion path. However, the kernel size is also fixed. The advantage of

our method is that the kernel size can be adaptively selected from the

data.

LPR is a very flexible and efficient nonparametric regression

method in statistics, and it has been widely applied in many research

areas such as data smoothing, density estimation, and nonlinear

modeling. Given a set of noisy samples of a signal, the data points are

fitted locally by a polynomial using the least-squares (LS) criterion with

8/17/2019 FullText (36)

59/144

40

a kernel function having certain bandwidth parameters. Since signals

may vary considerably over time, it is crucial to choose a proper kernel

size or local bandwidth to achieve the best basis-variance tradeoff. In

this paper, we used the refined intersection of confidence intervals (R-

ICI) method to perform bandwidth selection. Here, we follow the

homoscedastic data model of the time series:

iiii X X mY )()( , (3-3-12)

where },,2,1|),{( ni X Y ii are a set of uninvariate observations,

)( i X m is a smooth function specifying the conditional mean of iY

given

i X , and i is an independent identically distributed (i.i.d.) additive

white Gaussian noise. The problem is to estimate )( i X m and its k-th

derivative )()( ik X m from the noisy sample

iY so as to achieve

smoothing. Since )( i X m is a smooth function, we can approximate it

locally as a general degree- p polynomial at a given point 0 x :

))(()()( 000 x x xm xm xm

p

p

x x p

xm x x

xm)(

!

)()(

!2

)(0

0

)(

2

0

0

,)()( 0010 p

p x x x x

(3-3-13)

where x is in the neighborhood of 0 x and k ),,1,0( pk is the k-th

polynomial coefficient. The coefficient vector T p

],,,[ 10 β at

location 0 x can be obtained by solving the following weighted least-

squares (WLS) regression problem:

8/17/2019 FullText (36)

60/144

41

}])()[({min1

2

000

n

i

p

k

k

ik iih x X Y x X K β

, (3-3-14)

where h K x X K h

x X

ihi /)()( 00

, )( K is a kernel function with

bandwidth parameter h, which emphasizes the influence of neighboring

observations around0 x in the estimation. The parameter h is adaptively

chosen at different locations 0 x so as to adapt to the local characteristics

of the signal (i.e. the intentional motion path). Differentiating the

objective function in (3-3-14) with respect to β and setting the

derivative as zero, we get the following LS solution in the matrix form:

Wy X WX X β T 1 T )()(ˆ 0 ,h x , (3-3-15)

where

P

nn

P

P

x X x X

x X x X

x X x X

)()(1

)()(1

)()(1

00

0202

0101

X , T

nY Y Y

21y , and

)}({ 0 x X K diag ih W is the weighting matrix.

By estimating )(ˆ0 ,h x β with an optimized bandwidth h at different

0 x , we obtain a smoothed representation of the data from the noisy

observations. In the context of video stabilization, a key problem of

applying LPR is thus to select an optimal bandwidth parameter h to

achieve the best bias-variance tradeoff in estimation. Here, we use the R-

ICI bandwidth selection algorithm [Zhan 2008] to select the optimal

bandwidth. The basic idea of the R-ICI adaptive bandwidth selection

method is to calculate a set of smoothing results with different

bandwidths and then to examine a sequence of confidence intervals of

these smoothing results to determine and refine the optimal bandwidth.

In this thesis, the kernel )(u K is chosen as the Epanechnikov

8/17/2019 FullText (36)

61/144

42

kernel )1)(4/3()( 2

uu K , and the bandwidth parameter set for R-ICI

is }10,,1,/:|{ j N ahh j j j with 2.1a . N is the total number

of frames. The details of the algorithm are omitted and interested readersare

FullText (36)

Documents