Post on 06-Jul-2018
transcript
8/17/2019 FullText (36)
1/144
TitleMutual information-based depth estimation and 3Dreconstruction for image-based rendering systems
Advisor(s) Chan, SC; Chang, C
Author(s) Zhu, Zhenyu;g c [‡
Citation
Issued Date 2012
URL http://hdl.handle.net/10722/173910
RightsThe author retains all proprietary rights, (such as patent rights)and the right to use in future works.
8/17/2019 FullText (36)
2/144
Mutual Information-based Depth Estimation
and 3D Reconstruction for Image-basedRendering Systems
by
ZHU Zhenyu (朱 )
B.Eng.
Ph. D. Thesis
A thesis submitted in partial fulfillment of the requirements for
the Degree of Doctor of Philosophy
at the University of Hong Kong
July 2012
8/17/2019 FullText (36)
3/144
I
Abstract of thesis entitled
Mutual Information-based Depth Estimation
and 3D Reconstruction for Image-based
Rendering Systems
Submitted by
ZHU Zhenyu
for the degree of Doctor of Philosophy
at the University of Hong Kong
in July 2012
Image-based rendering (IBR) is an emerging technology for
rendering photo-realistic views of scenes from a collection of densely
sampled images or videos. It provides a framework for developing
revolutionary virtual reality and immersive viewing systems. There has
been considerable progress recently in the capturing, storage and
transmission of image-based representations. This thesis proposes two
image-based rendering (IBR) systems for improving the viewing
freedom and environmental modeling capability of conventional static
IBR systems. The first system consists of a circular array with 13 still
cameras (Canon 550D) for capturing ancient Chinese artifacts at high
resolution. The second one is constructed by mounting a linear array of 8
video cameras (Sony HDR-TGIE) on an electrically controllable wheel
8/17/2019 FullText (36)
4/144
II
chair with its motion being controllable manually or remotely through
wireless local area network (LAN) by means of additional hardware
circuitry.
Both systems support object-based rendering and 3D reconstruction
capability and consist of two main components. 1) A novel view
synthesis algorithm using a new segmentation and mutual information
(MI)-based algorithm for dense depth map estimation, which relies on
segmentation, local polynomial regression (LPR)-based depth map
smoothing and MI-based matching algorithm to iteratively estimate the
depth map. The method is very flexible and both semi-automatic and
automatic segmentation methods can be employed. They rank fourth and
sixth, respectively, in the Middlebury comparison of existing depth
estimation methods. This allows high quality renderings of outdoor and
indoor scenes with improved mobility/freedom to be obtained. This
algorithm can also be extended to object tracking. Experimental results
also show that the proposed MI-based algorithms are applicable to
robust registration in noisy dynamic ultrasound images. 2) A new 3D
reconstruction algorithm which utilizes sequential-structure-from-motion
(S-SFM) technique and the dense depth maps estimated previously. It
relies on a new iterative point cloud refinement algorithm based on
Kalman filter (KF) for outlier removal and the segmentation-MI-based
algorithm to further refine the correspondences and the projection
matrices. The mobility of our system allows us to recover more
conveniently 3D model of static objects from the improved point cloud
using a new robust radial basis function (RBF)-based modeling
algorithm to further suppress possible outliers and generate smooth 3D
meshes of objects. Moreover, a new rendering technique named view
dependent texture mapping is used to enhance the final rendering effect.
8/17/2019 FullText (36)
5/144
III
Experimental results show that the proposed 3D reconstruction
algorithm significantly reduces the adverse effect of the outliers and
produces high quality renderings using view dependent texture mapping
and the model reconstructed.
Overall, this study provides a framework for designing IBR systems
with improved viewing freedom and ability to cope with moving and
static objects in indoor and outdoor environment.
An abstract of exactly 439 words
8/17/2019 FullText (36)
6/144
IV
Declaration
I hereby declare that this dissertation, submitted in partial
fulfillment of the requirements for the degree of Philosophy and entitled
“Mutual Information-based Depth Estimation and 3D Reconstruction for
Image-based Rendering Systems” represents my own work except where
due acknowledgement is made, and has not been previously included in
a thesis, dissertation, or report submitted to this or any other institution
for a degree, diploma or other qualification.
Zhu Zhenyu
August 2012
8/17/2019 FullText (36)
7/144
V
Acknowledgement
First of all, I would like to extend my sincere gratitude to my
supervisor, Dr. S. C. Chan and Dr. C. Q. Chang for their instructive
advice and useful suggestions on my thesis. Without their consistent and
illuminating instruction, this thesis could not be possible.
Besides, I highly appreciate all postgraduate students and staffs in
the Digital Signal Processing (DSP) Laboratory for their helpful
discussion and support. They are: Dr. K. T. Ng, Dr. Z. G Zhang, Dr. K.
M. Tsui, Mr. James Koo, Mr. B. Liao, Mr. C. Wang, Mr. S. Zhang, Mr.
H. C. Wu and Miss Y. J. Chu.
Last, but not the least, my thanks go to my beloved family for their
patience and love in me all through these years.
8/17/2019 FullText (36)
8/144
VI
Contents
DECLARATION .................................................................................. IV
ACKNOWLEDGEMENT ..................................................................... V
CONTENTS .......................................................................................... VI
LIST OF FIGURES ............................................................................... X
LIST OF TABLES .............................................................................. XV
LIST OF ABBREVIATIONS .......................................................... XVI
CHAPTER 1 INTRODUCTION ........................................................... 1
1.1 BACKGROUND .................................................................................. 1
1.2 THESIS OUTLINE .............................................................................. 4
CHAPTER 2 REVIEW OF BASIC TOPICS IN IMAGE-BASED
RENDERING .......................................................................................... 9
2.1 I NTRODUCTION ................................................................................ 9
2.2
R EVIEW OF PLENOPTIC FUNCTION ................................................... 9
2.2.1
Basic Theory .......................................................................... 10
8/17/2019 FullText (36)
9/144
VII
2.3 R EVIEW OF LIGHT FIELD ................................................................ 13
2.3.1
Creating/capturing light field ................................................ 13
2.4 R EVIEW OF R ENDERING TECHNIQUES ............................................ 14
2.5
SUMMARY ...................................................................................... 18
CHAPTER 3 THE PROPOSED IMAGE-BASED RENDERING
SYSTEMS .............................................................................................. 20
3.1
I NTRODUCTION .............................................................................. 20
3.2
CONSTRUCTION OF THE PROPOSED IBR SYSTEMS .......................... 23
3.2.1
Still Camera System ............................................................... 23
3.2.2 Moveable Camera System ...................................................... 26
3.3
PRE-PROCESSING ........................................................................... 30
3.3.1
Still Camera System ............................................................... 30
3.3.1.1 Camera Calibration ............................................................. 30
3.3.1.2 Color-Tensor-based Segmentation and Matting .................. 33
3.3.2
Moveable Camera System ...................................................... 36
3.3.2.1 Video Stabilization ............................................................... 36
3.4 SUMMARY ...................................................................................... 44
8/17/2019 FullText (36)
10/144
VIII
CHAPTER 4 A NEW COMBINED SEGMENTATION-MUTUAL-
INFORMATION (MI)-BASED ALGORITHM FOR DENSE
DEPTH MAP ESTIMATION .............................................................. 45
4.1
I NTRODUCTION .............................................................................. 45
4.2
COMBINED SEGMENTATION-MI-BASED DEPTH ESTIMATION ......... 46
4.2.1
Object Segmentation Using Level-Set Method ...................... 47
4.2.2
Mutual Information Matching ............................................... 49
4.3 DEPTH MAP R EFINEMENT .............................................................. 54
4.3.1
Occlusion Detection and Inpainting ...................................... 56
4.3.2
Smoothing of Depth Maps ...................................................... 56
4.4 MUTUAL I NFORMATION (MI)-BASED OBJECT TRACKING .............. 63
4.5
MORE R ESULTS AND COMPARISON ................................................ 65
4.6 SUMMARY ...................................................................................... 69
CHAPTER 5 3D RECONSTRUCTION AND MODELING .......... 71
5.1
I NTRODUCTION .............................................................................. 71
5.2
HOMOGENEOUS GEOMETRY ........................................................... 74
5.3 POINT MATCHING IN THE STILL IBR SYSTEM ................................ 76
5.3.1
Epipolar Geometry................................................................. 76
8/17/2019 FullText (36)
11/144
IX
5.3.2 Finding Correspondent Points............................................... 77
5.4
VIEW DEPENDENT TEXTURE MAPPING .......................................... 82
5.5 POINT MATCHING AND R EFINEMENT IN THE MOVEABLE IBR SYSTEM ................................................................................................. 88
5.5.1
Structure-from-motion ........................................................... 88
5.5.2
Point Cloud Generation and Refinement: KF-based Outlier
detection and Point Cloud Fusion ................................................... 90
5.6 RBF MODELING AND MESH GENERATION ...................................... 98
5.7
SUMMARY .................................................................................... 102
CHAPTER 6 CONCLUSION AND FUTURE RESEARCH ........ 103
6.1 CONCLUSION ................................................................................ 103
6.2
FUTURE R ESEARCH ...................................................................... 105
APPENDIX I PUBLICATIONS ...................................................... 107
REFERENCES .................................................................................... 109
8/17/2019 FullText (36)
12/144
X
List of Figures
Figure 1-1 Spectrum of IBR representations.………………… 2
Figure 2-1 Light field describes the amount of light in
radiance along light rays traveling in every
direction through every point in empty space [Ikeu
2012]..…………………………………………….. 10
Figure 2-2 Forward mapping……………........……..………... 16
Figure 2-3 Example renderings using (a) forward mapping in
point rendering [Chan 2005], (b) layered
representation (with two layers – dancer and
background) [Chan 2009], (c) monolithic
rendering using 3D polygonal mesh (left) and
rendering results (right) [Zhu 2010]…………..….. 19
Figure 3-1 Plenoptic videos: Multiple linear camera array of
4D simplified dynamic light field with viewpoints
constrained along line segments. The camera
arrays developed at [Chan2009]. Each consists of
6 JVC video cameras……….......................……… 21
Figure 3-2 Circular camera array constructed……….…….…. 24
Figure 3-3 Snapshots: (a)Buddha (b) Dragon Vase…..………. 25
Figure 3-4 Block diagram of the proposed IBR system…….... 26
Figure 3-5 The proposed moveable image-based rendering
system………………………………..…………… 27
Figure 3-6 Snapshots of the plenoptic videos at a given time
instance: (a) is the “ Podium” outdoor video from
camera 1 to camera 4 and (b) is the “ Presentation”
8/17/2019 FullText (36)
13/144
XI
indoor video from camera 1 tocamera …………………………………..……....... 29
Figure 3-7 Block diagram of the proposed M-IBR system
constructed………………………………………... 30
Figure 3-8 Relationship between the world coordinate and the
camera coordinate……………..……..………….... 31
Figure 3-9 Planar patten............................................................ 33
Figure 3-10 (a) Extraction results using color-tensor-basedmethod. Left: original, middle: hard segmentation,
right: after matting. (b) Close up of segmentations
in (a). Left: hard segmentation, Right: aftermatting………………………………………….... 36
Figure 3-11 Motion smoothing results for horizontal(Translation-x) and vertical (Translation-y)
directions. The original motion path and the
smoothed motion path with different methods are
shown. In (a)-(b), the blue dotted lines correspond
to the shaky original motion path. Green and blacklines correspond to the smoothed motion path
using the method in [Mats 2005] with a small and
a large kernel sizes respectively…………………... 43
Figure 3-12 Video Stabilization result. The first row shows the
original images captured by our system; thesecond row shows the stabilized images without
video completion; the third row shows thecompleted results…………………………………. 43
Figure 4-1 Segmentation results using the level-set-basedtracking method. (a) is the initial segmentation
obtained by lazy snapping, (b) is the initial
segmentation obtained by graph cut method…....... 48
Figure 4-2 A regular grid for Local Transformation…………. 51
Figure 4-3 (a) is an example depth map obtained by using MImatching without segmentation information; (b)
shows the depth map obtained by using automatic
8/17/2019 FullText (36)
14/144
XII
segmentation MI matching; (c) shows the depthmap obtained by using semi-automatic
segmentation MI matching. Green areas in (c) are
the occlusion areas detected by our algorithm. (d)-
(e) show the refined depth maps of (c) byinpainting and smoothing (c) using SK-LPR-R-ICIand 2525 ideal low-pass filter, respectively…… 55
Figure 4-4 (a) and (b) show the renderings obtained by Figs.
4-2(d) and (b); (c) and (d) are the enlargements of
the red boxes…………………………………….... 59
Figure 4-5 Rendering results obtained by the proposed
algorithm. (a) shows the depth maps
corresponding to images in (b). The highlightedimages in (b) shows the rendered views from the
adjacent views in (b) using depth maps in (a). (c)
shows depth maps at other positions……….…….. 62
Figure 4-6 Example rendering results. The first row shows theoriginal images captured by our M-IBR system.
The second and third rows show renderings with a
step-in ratio of about 1.15 to 1.25 times.................. 63
Figure 4-7 Object tracking at different time instances……...... 65
Figure 4-8 Teddy test images [Scha2002] and depth maps for
comparison. (a) LEFT image; (b) RIGHT image;
(c) ground truth depth map; (d) depth map
calculated by semi-automatic segmentation-based
MI matching; (e) depth map calculated by
automatic segmentation-based MI matching……... 67
Figure 4-9 Results for the “conference”. (a) and (c) are twosample frames. (b) and (d) are the depth maps of
(a) and (c), respectively……………………......…. 68
Figure 4-10 Ultrasound images of RF muscle under relaxed
condition and at 50% maximal voluntarycontraction (MVC) contraction level and the
corresponding images with outlined boundary
contours. The tracked boundaries are highlighted
8/17/2019 FullText (36)
15/144
XIII
in green……............................................................ 69
Figure 5-1 Epipolar Geometry................................................... 76
Figure 5-2 Feature Points Detection. The red points in (a) and
(b) are the feature points. (a) is from the first
camera. (b) is from the second camera…………… 80
Figure 5-3 Epipolar Line……………………………………... 81
Figure 5-4 Rectified images, (a) is the rectified left image, (b)
is the rectified right image. (c) is the part of (a), (d)
is the part of (b)…………………………………… 81
Figure 5-5 An initial point cloud extracted with noise andoutliers.………………………………...…………. 82
Figure 5-6 View Dependent Texture Mapping...……………... 83
Figure 5-7 View Dependent Texture, Left: blurred texture.
Right: texture after the proposed view dependenttexture……………..………………………….…... 85
Figure 5-8 3D models of Ancient Chinese Artifacts. (a)
Dragon Vase, (b) Buddha, (c) Green Bottle, (d)Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine
Glass.………………….…………………………... 87
Figure 5-9 Rendering Results of Ancient Chinese Artifact…... 88
Figure 5-10 Iterative refinement of point cloud: (a) initial point
cloud. (b) point cloud after outlier detection and
Kalman filtering. (c) point cloud after the
proposed iteration method………………..………. 91
Figure 5-11 (a)-(b) shows the 3D to 2D re-projection at frame
20 and frame 21, respectively. Blue points are
inliers. Green points are outliers detected by the
segmentation consistency check. Red points are
the outliers detected by intensity and location
consistency checks. (c) shows the enlargement ofthe highlight area in (a). The point cloud is down-
sampled for better visualization…………………... 97
8/17/2019 FullText (36)
16/144
XIV
Figure 5-12 Convergence behavior of the root mean squaredistance (RMSD) versus the number of iteration
for the proposed iterative 3D reconstruction
algorithm. The blue line shows the RMSD values
with the KF-based outlier detection. The red lineshows the RMSD values without KF-based outlierdetection…………………………………………...
97
Figure 5-13 3D reconstruction results (a) without using RBF,
(b) using RBF without outlier detection and (c)
using RBF with outlier removal………………… 100
Figure 5-14 Object-based rendering results of “Podium”
sequences” using the estimated 3D model andshadow field at different lightening conditions.….. 101
Figure 5-15 Object-based rendering results of the “conference”
sequence. (a) and (b) are the 3D reconstruction
result of two time instances. (c) and (d) are the
rendering results of (a) and (b). Note, only partialgeometry of the dynamic object is recovered, since
it is partially observable………………………… 101
8/17/2019 FullText (36)
17/144
XV
List of Tables
Table 2-1 A taxonomy of plenoptic functions…….................... 12
Table 4-1 Comparison of the rank using standard threshold of
1 pixel on middlebury test stereo images…………... 67
8/17/2019 FullText (36)
18/144
XVI
List of Abbreviations
BRDF Bidirectional Reflectance Distribution Function
BP Belief Propagation
CLF Circular Light Field
CPU Central Processing Unit
CSA Cross-Sectional AreaDCP Disparity Compensation Prediction
DCT Discrete Cosine Transform
DSCs Digital Still Cameras
DSP Digital Signal Processing
Fig. Figure
fps frames per second
GC Graph Cut
GPU Graphic Processing Unit
HD High Definition
IBR Image-Based Rendering
i.i.d. independent identically distributed
ISKR Iterative Steering Kernel Regression
JVT Joint View Triangulation
KF Kalman Filter
LAN Local Area Network
L-BFGS Limited-memory Broyden-Fletcher-Goldfarb-Shanno
LDIs Layered Depth Images
LPR Local Polynomial Regression
8/17/2019 FullText (36)
19/144
XVII
LS Least Square
MCU Micro-Controller Unit
MI Mutual Information
M-IBR Moveable Image-Based Rendering
MRF Markov Random Field
MVC Maximal Voluntary Contraction
PCA Principal Component Analysis
pdf probability density function
PRT Pre-computed Radiance Transfer
QPP Quadratic Programming Problem
RANSAC RANdom SAmple Consensus
RBF Radial Basis Function
RF Rectus Femoris
R-ICI Refined Intersection of Confidence Intervals
RMSD Root-Mean Squared Distance
SCLF Simplified Circular Light FieldSFM Structure-From-Motion
SPIHT Set Partitioning In Hierarchical Trees
S-SFM Sequential-Structure-From-Motion
8/17/2019 FullText (36)
20/144
1
Chapter 1 Introduction
1.1 Background
Image-based rendering/representation (IBR) [Chen 1995], [Debe
1996], [Gort1996], [Levo 1996], [McMi 1995], [Pele 1997], [Szel 1997],
[Shad 1998], [Shum 1999] is a promising technology for rendering new
views of scenes from a collection of densely sampled images or videos.
It has potential applications in virtual reality, immersive television and
visualization systems. Central to IBR is the plenoptic function [Adel1911], which describes the intensity of each light ray in the world as a
function of visual angle, wavelength, time, and viewing position. The
plenoptic function is thus a 7-dimensional function of the viewing
position ),,( z y x V V V , the azimuth and elevation angles ),( , time ,
and wavelengths . Traditional images and videos are just 2D and 3D
special cases of the plenoptic function. In principle, one can reconstructany views in space and time if sufficient number of samples of the
plenoptic function is available. The rendering of novel views can
therefore be viewed as the reconstruction of the plenoptic function from
its samples. Image-based representations are usually densely sampled
high dimensional data with large data sizes, but their samples are highly
correlated. Because of the multidimensional nature of image-based
representations and scene geometry, much research has been devoted to
the efficient capturing, sampling, rendering and compression of IBR.
Depending on the functionality required, there is a spectrum of IBR
as shown in Fig. 1-1. They differ from each other in the amount of
geometry information of the scenes/objects being used. At one end of the
spectrum, like traditional texture mapping, we have very accurate
8/17/2019 FullText (36)
21/144
2
geometric models of the scenes and objects say generated by animation
techniques, but only a few images are required to generate the textures.
Given the 3- D models and the lighting conditions, novel views can be
rendered using conventional graphic techniques. Moreover, interactive
rendering with moveable objects and light sources can be supported
using advanced graphic hardware.
Figure 1-1 Spectrum of IBR representations [Chan 2010].
At the other extreme, light field or lumigraph rendering relies on
dense sampling (by capturing more image/videos) with no or very littlegeometry information for rendering without recovering the exact 3- D
models. An important advantage of the latter is its superior image quality,
compared with 3- D model building for complicated real world scenes.
Another important advantage is that it requires much less computational
resources for rendering regardless of the scene complexity, because most
of the quantities involved are pre-computed or recorded. This has
attracted considerable attention in the computer graphic community
recently in developing fast and efficient rendering algorithms for real-
time relighting and soft-shadow generation [Agra 2000], [Ng 2004],
[Sloa 2002], [Zhou 2005].
Broadly speaking, image-based representations can be classified
according to the geometry information used into three main categories: 1)
Rendering with
no geometry
Rendering with
implicit geometry
Rendering with
explicit geometryLight field
Concentric mosaics
Mosaicking
Lumigraph
View interpolation
View morphing
Layered-depth images
Texture-mapped models3D warping
View-dependent texture,Shadow light field
Less geometry More geometry
Less imagesMore images
View-dependent geometry
8/17/2019 FullText (36)
22/144
3
representations with no geometry, 2) representations with implicit
geometry and 3) representations with explicit geometry. 2-D Panoramas,
McMillan and Bisho p’s plenoptic modeling [McMi 1995], 3-D
concentric mosaics and light field/lumigraph belong to the first category
and they can be viewed as the direct interpolation of the plenoptic
function. Layered-based, object-based representations [Chan 2009], pop-
up light [Shum 2004] using depth maps fall into the second. Finally,
conventional 3-D computer graphic models and other more sophisticated
representations [Deve 1998], [Wang 2005] belong to the last category.
Although these representations also sample the plenoptic function,
further processing of the plenoptic function has been performed to infer
the scene geometry or surface property such as bidirectional reflectance
distribution function (BRDF) of objects. Such image-based modeling
approach has emerged as a more promising approach to enrich the
photorealism and user interactivity of IBR. Moreover, since 3-D models
of the scenes are unavailable, conventional image-based representationsare limited to the change of viewpoints and sometimes limited amount of
relighting. Recently, it was found that real-time relighting and soft-
shadow computation are feasible using the IBR concepts and the
associated 3-D models using pre-computed radiance transfer (PRT)
[Sloa 2002] and precomputed shadow fields [Zhou 2005].
For multiple camera arrays, the huge amount of data and vast
amount of viewpoints to be provided present one of the major challenges
to IBR. Advanced algorithms for processing and manipulation of the
high dimensional representation to achieve such functions as
segmentation, depth estimation, object tracking, 3D reconstruction, etc.
are all major challenges to be addressed. Finally, the efficient
transmission, compression and display of dynamic IBR and models are
8/17/2019 FullText (36)
23/144
4
also urgent issues waiting for satisfactory solution in order for IBR to
establish itself as an essential media for communication and presentation.
All of these motivate us to study the design and construction of the
image-based rendering systems based on plenoptic videos. The system
can potentially provide improved viewing freedom to users and ability to
cope with moving and static objects for 3D reconstruction.
1.2 Thesis Outline
This thesis is devoted to the design of image-based renderingsystems and its associating algorithms so as to provide improved
viewing freedom and object modeling of stationary and moveable
objects in outdoor and indoor environment. The major contributions of
this thesis are summarized as follows:
1) The construction of a high resolution IBR system for capturing and
rendering of ancient Chinese artifacts and a moveable IBR system
for capturing and rendering indoor and outdoor objects.
2) Development of a novel mutual information (MI)-based algorithm
combined with segmentation for dense depth map estimation and
object tracking.
3) A 3D reconstruction algorithm for objects, which employs the
estimated dense depth maps to obtain dense point correspondences
from multiple views for 3D reconstruction.
8/17/2019 FullText (36)
24/144
5
Details of these contributions are briefly described below:
1) The first prototype system uses a multiple still camera array to
capture ancient Chinese artifacts. Because of the high resolution of
the still camera (Canon 550D), we can obtain excellent rendering
quality. This system can be used for digital preservation and
dissemination of cultural artifacts with high digital quality. To avoid
possible damage to the artifacts and speed up the capturing process,
we propose to employ the image-based approach instead of using
traditional 3D laser scanners. A circular array consisting of multiple
digital still cameras was therefore constructed in this work. Using
this circular camera array, we developed novel techniques for
rendering new views of the artifacts from the images captured using
the object-based approach. The multiple views so synthesized enable
the ancient artifacts to be displayed in modern multi-view displays.
A number of ancient Chinese artifacts from the University Museum
and Art Gallery at the University of Hong Kong were captured and
excellent rendering results were obtained. The second prototype
system uses a linear camera array consisting of 8 video cameras
(Sony HDR-TGIE) mounted on an electrically controllable wheel
chair. Its motion can be controlled manually or remotely by means of
additional hardware circuitry. Unlike the previous multiple camera
systems which are not designed to be moveable so that the view-
points are somewhat limited and usually cannot cope with moving
objects and perform 3D reconstruction of objects in open
environment. Our moveable image-based rendering system can be
used to render large environment and moving objects.
8/17/2019 FullText (36)
25/144
6
2) A new combined segmentation-mutual-information (MI)-based
algorithm for dense depth map estimation is presented. It relies on
segmentation, local polynomial regression (LPR)-based depth map
smoothing and MI-based matching algorithm to iteratively estimate
the depth map. The method is very flexible and both semi-automatic
and automatic segmentations can be used. The semi-automatic and
automatic versions rank 4 and 6 respectively in the Middlebury
comparison of existing depth estimation methods. Using the depth
maps captured and the object-based approach, high quality
renderings of outdoor scenes along the trajectory can be obtained,
which considerably improved the viewing freedom. The mutual
information-based matching algorithm is also extended to object
tracking algorithms. It can be used to track the boundary of an object
in a video sequence. Experimental results show that its performance
is reliable even for noisy videos such as dynamic ultrasound images.
3) Using the IBR systems, correspondences from different views can be
integrated together for 3D reconstruction. For both of the systems,
camera calibration is firstly used to determine the value of internal
and external parameters of the cameras. For the still camera array, a
major technique to find the correspondent points is the epipolar
geometry which can constraint the corresponding points on the
conjugated epipolar lines. Meanwhile, via combining epipolar lines
with Scale-invariant feature transform (SIFT) [Lowe 2004] feature
detection, accurate sparse correspondent points can be located. Then
Gabor filter, which is rather insensitive to noise, is used to obtain the
dense correspondent points. For the moveable image-based rendering
(M-IBR) system, the sequential-structure-from-motion (S-SFM)
technique is adopted to estimate the locations of the M-IBR system
8/17/2019 FullText (36)
26/144
7
so as to obtain an initial set of fairly reliable 3D point cloud from the
2D correspondences. New iterative Kalman filter (KF)-based and
segmentation-MI-based algorithms are proposed to fuse the
correspondences from different views and remove possible outliers
to obtain an improved point cloud. More precisely, the proposed
algorithm relies on the KF to track the correspondences across
different views so as to suppress possible outliers while fusing
correspondences from different views. With these reliable matched
points, the camera parameters and hence the image correspondences
can be further refined by re-projecting the updated correspondences
to successive views to serve as prior features/correspondences for
MI-based matching. By iterating these processes, an improved point
cloud with reliable correspondences can be recovered. Simulation
results show that the proposed algorithm significantly reduces the
adverse effect of the outliers and generates a more reliable point
cloud. To recover the 3D model from the improved point cloud, anew robust RBF-based modeling algorithm is proposed to further
suppress possible outliers and generate smooth 3D surfaces from the
raw 3D point cloud. Compared with the conventional RBF-based
smoothing, it is more robust and reliable. Finally, view dependent
texture is incorporated to enhance the final rendering effect.
This thesis is divided into 6 chapters. In Chapter 2, some background
materials on IBR are briefly reviewed. They include plenoptic
function, light field and rendering techniques. In Chapter 3, the
design and construction of two IBR systems are presented. Some
pre-processing techniques for capturing the ancient Chinese artifacts
8/17/2019 FullText (36)
27/144
8
including camera calibration and color-tensor-based segmentation
are also introduced. Chapter 4 is devoted to a new combined
segmentation MI-based depth estimation algorithm. The 3D
reconstruction and modeling algorithms will be presented in Chapter
5.Two different point matching algorithms are studied first and then
a RBF modeling algorithm is proposed for mesh generation. At last,
a view dependent texture mapping method for improving the
rendering quality will be presented. Finally, conclusion and future
research topics are given in Chapter 6.
8/17/2019 FullText (36)
28/144
9
Chapter 2 Review of Basic Topics in Image-Based
Rendering
2.1 Introduction
In this chapter, the fundamental topics in image-based rendering are
reviewed briefly. In section 2.2, the plenoptic function and its history are
introduced. The theory of light field is dicusssed in section 2.3. Section
2.4 is devoted to the rendering techniques in IBR.
2.2 Review of Plenoptic Function
The plenoptic function was proposed by Bergen and Anderson
[Adel 1991]. It is a function presented by visual angle, wavelength, time
and viewing position to describe the intensity of each light ray in the
world. All the information captured by an optical sensor can be depicted
by this function. The plenoptic function is a 7 dimension (7D) function
consisting of 3 dimenion (3D) position, 2 dimension visual angle ,
wavelength and time.
Sampling and processing of the plenoptic function are the main
research topic in the early computer vision study. For example, object
motion can be discribed by the derivatives of the plenoptic function in
terms of position and time. Because the wavelength is usually
represented by Red, Green, Blue channels in digital image processing,
the plenoptic function of images and videos can be simplified into two
dimension and three dimension special cases. Theoretically, if the
sampling rate is very high, novel views at intermediate positions can be
recovered from its samples. The algorithms which are trying to solve this
problem are usually called image-based rendering.
8/17/2019 FullText (36)
29/144
10
θ
ϕ
l(x,y,z,θ,ϕ )
(x,y,z)
î ĵ
k ˆ
Figure 2-1: Light field describes the amount of light in radiance along light rays
traveling in every direction through every point in empty space [Ikeu 2012].
Because the plenoptic function also describes the geometry and
surface properties, many algorithms are proposed to integrate the
geometry and surface information into image-based rendering to
improve user interaction and reduce the amount of samples required.
The capturing, sampling, rendering and processing of the plenoptic
function are all important research topics in IBR and related applications
such as computational photography, 3D/multiview videos and displays,
etc.
2.2.1 Basic Theory
The 7D plenoptic function is usually defined as
),,,,,,( z y x V V V l . ),,( z y x V V V is the viewing position. and are
the elevation and azimuth angles respectively as shown in (Fig. 2-1).
and denote the wavelength and time respectively. By employing
different parameterization and simplification, different image-based
rendering algorithms can be derived from the plenoptic function.
8/17/2019 FullText (36)
30/144
11
There are several camera systems which are usually used in the
image-based rendering for capturing. For static scene, one camera can be
rotated around the camera centre at a given position V with different
elevation and azimuth angles. The plenoptic function is then simplified
to a panorama ),( V l . The spherical camera array can provide other
panoramas representation because the captured image can be projected
to a cylinder. If a multiple video camera array is employed instead, a
panoramic video can be obtained. And the plenoptic function will be
simplified to a 3D panorama ),,( V l for dynamic scenes. The close
relationship between plenoptic function and image-based rendering was
due to McMillan and Bishop [McMi 1995] who proposed plenoptic
modeling using the 5D complete plenoptic function for static scene
),,,,( z y x V V V l .
In the static scene, the radiance along rays is constant. Therefore
the plenoptic function can be re-written as a 4D function which is called
the light field [Levo 1996] and lumigraph [Gort 1996] in computer
graphics. The set of light rays in a 4D static light field can be
parameterized in many ways. For example, the two-plane-based
parameterization is usually used. By adding the time into the static light
field, a 5D plenoptic function can be obtained. Lumigraph employs
depth maps into image-based rendering to improve the rendering quality
which can produce more accurate representations. In [Shum 1999], an
outward facing camera moving on a circle was used to capture a series
of densely sampled images.
A commonly used parameterization is the two-plane
parameterization, where a light ray in the light field is parameterized as
its intersections or coordinates with two parallel planes. These rays can
8/17/2019 FullText (36)
31/144
12
Table 2-1: A taxonomy of plenoptic functions.
be captured by taking a series of pictures on a 2D rectangular plane,
which results in an array of images. The light field concept can be
similarly extended to time-varying or dynamic scenes, which results in a
5D function. Lumigraph is different from the light field, because the
geometry in form of depth maps is used to improve the rendering quality
which can produce more sophisticated representations in image-based
modeling. In [Shum 1999], a set of densely sampled images are captured
by an outward facing camera moving on a circle which are called
concentric mosaic. This system can render new views inside the circle.
Then some simplified systems are proposed to reduce the complexity
such as restricting the camera locations to line by line segments [Zitn
2004], [Chan 2005], [Chan 2009]. For time varying or dynamic scenes,
similar representations can be used. Because the light may change at
different viewing location, the light has to be captured continuously
which can be done by video camera arrays. For static scenes, the light
directions can be recorded at first. Then one can relight the rendering
Dimension Year View space Name
7 1991 Free Plenoptic function
5 1995 Free Plenoptic modeling
4 1996 Bounding box Light field/Lumigraph
3 1999 Bounding circle Concentric Mosaics
2 1994 Fixed point Cylindrical/Spherical panorama
8/17/2019 FullText (36)
32/144
13
with arbitrary lightings. A brief summary of these plenoptic function
representations is given in Table 2-1 [Ikeu 2012].
2.3 Review of Light Field
Light field was introduced firstly in a paper by A. Gershun [Gers
1939] for studying surface illumination by artificial lightings. A similar
concept was introduced to the computer graphics community as the light
field in [Levo 1996] and lumigraph in [Gort 1996]. The motivation is to
render new views or images of objects or scenes from densely sampled
images previously taken to avoid building or capturing complicated 3D
models. Light field or lumigraph rendering is a special representation of
image-based rendering and they require either no geometry [Levo 1996]
or limited geometry in terms of depth maps [Gort 1996]. The Light field
and lumigraph are four dimension (4D) simplification of the plenoptic
function for static scenes.
2.3.1 Creating/capturing light field
Light field can be created by rendering 3D models by computer
graphics or capture the real object by camera arrays. For real and static
scenes, light field can be captured by one still camera controlled by a
mechanical arm using lumigraph rendering [Bueh 2001]. In [Adel 1991],
[Ng 2005], a lenticular lens array was used to capture the light field. In
[Veer 2007], [Lian 2008], a coded aperture ,which can map rays from
different directions to near pixels in the sensor array, was used to record
images. The pixels of these images consist of a set of pixels recording
light from different direction. The novel views can be estimated by
combining these 4D samples in the light field. In [Ng 2005], a mirolens
array was placed in front of the handheld digital camera. The images
8/17/2019 FullText (36)
33/144
14
captured in this way can be refocused after they have been taken. The
light field video can be obtained in a similar way.
Multiple camera systems are usually used to achieve large disparity
in dynamic scenes. Much research effort has been devoted to the
construction of 2D camera arrays. To simplify the capturing hardware,
light field captured on line segments and circular arc have also been
reported in Section 2.2.
2.4 Review of Rendering Techniques
Rendering is the process to create new view from several imagesand other auxiliary information obtained in the representations. In the
early stage of image-based rendering, it did not employ any geometry
information. Image blending in panoramas [Chen 1995] and ray space
interpolation in light field [Levo 1996] are used to do image rendering.
Each ray that goes through a target pixel is mapped to nearby sampled
rays in ray space interpolation. Since some sophisticated representations
use more geometry information such as layered depth images [Shad
1998], surface light field [Wood 2000], and pop-up light field [Shum
2004], graphics hardware has been exploited to accelerate the rendering
process. The geometry information can either implicitly rely on
positional correspondences or explicitly in the form of depth along
known lines-of-sight or 3D coordinates. Representations of the former
usually involve weakly calibrated cameras and rely on image
correspondences to render new views, say by triangulating two reference
images into patches according to the correspondences as in joint view
triangulation (JVT) [Lhui 2003]. These include view interpolation, view
morphing, JVT and transfer methods with fundamental matrices and
trifocal tensors. Representations employing explicit geometry include
8/17/2019 FullText (36)
34/144
15
sprites, relief textures, Layered Depth Images (LDIs), view-dependent
texture, surface light field, pop-up light field, shadow light field, etc.
In general, the rendering methods can be broadly classified into
three categories: 1) point-based, 2) layer-based, and 3) monolithic.
Point-based rendering works on 3D point clouds or point
correspondences and typically each point is rendered independently.
Points are mapped to the target image plane through forward mapping.
For the 3D point X in Fig. 2-2, the mapping can be written as
where t x and r x are homogeneous coordinates of the projection of X on
target screen and reference images, respectively. C and P are camera
center and projection matrix respectively and is a scale factor. Since
t C ,
t P and the focus length
t f are known for the target image, t can be
computed using the depth of X . Givenr
x and r , one can compute the
exact position oft
x on the target screen and transfer the color
accordingly. Gaps or holes may exist due to magnification. Disocclusion
and splatting techniques have been proposed to solve this problem. The
painter ’s algorithm is frequently used to avoid the problem that multiple
pixels from the reference view are mapped to the same pixel in the target
image.
Layered techniques usually separate the scene into a group of
planar layers consisting of a 3D plane with texture and optionally a
transparency map. The layers can be thought of as a continuous set of
polygonal models, which are amenable to conventional texture mapping
and view-dependent texture mapping. Usually, each layer is rendered
using either point-based or polygon meshes as in monolithic rendering
t t t r r r x P C x P C X t r , (2-4-1)
8/17/2019 FullText (36)
35/144
16
techniques before being composed in the back-to-front order using the
painter’s algorithm to produce the final view. Layer-based rendering can
be implemented easily using graphic processing unit (GPU). Since the
rendering of IBR requires very low complexity, it is even possible to
perform the calculation using central processing unit (CPU) by working
on individual layer or object [Chan 2009].
Monolithic rendering usually represents the geometry as continuous
polygon meshes with textures, which can be readily rendered using
X
xr xt
Cr
Cter
et
X
xr xt
Cr
Cter
et
Figure 2-2: Forward mapping.
8/17/2019 FullText (36)
36/144
17
graphics hardware. The 3D model normally consists of vertices, normals
of vertices, faces, and texture mapping coordinates. The data can be
stored in a variety of data formats. The most popular formats
are .obj, .3ds, .max, .stl, .ply, .wrl, .dxf, etc.
Relighting, shadow generation and interactivity have played an
increasingly important role in 3D interactive rendering. The most
popular algorithms are shadow mapping, shadow volume, ray-tracing,
pre-computed radiance transfer, pre-computed shadow field, etc. Some
of them have better rendering quality, while others are more efficient for
real time rendering. Thanks to the development of GPU, basic lighting
and shading algorithms like shadow mapping and shadow volume have
been realized on the fly. Modern GPUs can even offer programmable
rendering pipelines for customized rendering effects and “shader ” is a
set of software instructions running on these GPUs to control the
pipelines. Using shader programming, high quality shadow rendering
algorithms like precomputed shadow field can be done in real time. Fig.
2-3 shows examples renderings of the three techniques.
Though there has been substantial progress in capturing,
representing, rendering and modeling scenes, the ability to handle
general complex scenes remains challenging for IBR. A lot of work is
still required to ensure robustness in handling reflection translucency,
highlights, depth estimation, capturing complexity, object manipulation,
etc. Interacting with IBR representations remains challenging because
IBR uses images for rendering. Recent approaches have been focused on
using advanced computer vision techniques, such as stereo/multiview
vision and photometric stereo, and depth sensing devices to extract more
geometry information from the scene so as to enhance the functionalities
8/17/2019 FullText (36)
37/144
18
of IBR representations. While there has been considerable progress in
relighting and interactive rendering of individual real static objects, such
operations are still difficult for real and complicated scenes. For
dynamic scenes, the huge amount of data and vast amount of viewpoints
to be provided present one of the major challenges to IBR. Advanced
algorithms for processing and manipulation of the high dimensional
representation to achieve such functions as object extraction, model
completion, scene inpainting, etc. are all major challenges to be
addressed. Finally, the efficient transmission, compression and display
of dynamic IBR and models are also urgent issues waiting for
satisfactory solution in order for IBR to establish itself as an essential
media for communication and presentation. All of these motivate us to
study the design and construction of new image-based rendering systems
based on plenoptic videos. The system can potentially provide improved
viewing freedom to users and ability to cope with moving and static
objects and perform 3D reconstruction.
2.5 Summary
In this chapter, the basic topics in image-based rendering have been
reviewed. The plenoptic function which serves an important concept for
describing visual information in our world was introduced. Then a brief
review on light field was given. In fact, how to achieve high quality
rendering and display light field with a wide range of viewing positions
in large scale environmental will be studied in Chapters 3 and 4. Finally,
some rendering techniques including point-based, layer-based, and
monolithic methods are discussed. An extension of these rendering
techniques will be further studied in Chapter 5.
8/17/2019 FullText (36)
38/144
19
(a)
(b)
(c)
Figure 2-3: Example renderings using (a) forward mapping in point rendering [Chan
2005], (b) layered representation (with two layers – dancer and background) [Chan
2009], (c) monolithic rendering using 3D polygonal mesh (left) and rendering results
(right) [Zhu 2010].
8/17/2019 FullText (36)
39/144
20
Chapter 3 The Proposed Image-based Rendering
Systems
3.1 Introduction
Both IBR systems are based on the simplified light field. As
mentioned earlier, two IBR systems are constructed and studied in this
thesis, one for capturing and rendering ancient Chinese artifacts and the
other for environmental modeling. They belong to the general class of
image-based representations. Since capturing 3D models in real-time is
still a very difficult problem, light field- or lumigraph-based dynamic
IBR representations with little amount of geometry information have
received considerable attention in immersive TV (also called 3D or
multi-view TVs) applications. Because of the multidimensional nature of
the plenoptic function and the scene geometry, much research has been
devoted to the efficient capturing, sampling, rendering and compression
of IBR. There has been considerably progress in these areas since the pioneer work of lumigraph by Gortler et al [Gort 1996] and light field by
Levoy and Hanrahan [Levo 1996]. Other IBR representations include the
2D panorama [Szel 1997, Pele 1997], Chen and Williams’ view
interpolation [Chen 1993], McMillan and Bishop’s plenoptic modeling
[McMi 1995], layer depth images [Shad 1998] and the 3D concentric
mosaics [Shum 1996], etc. Motivated by light field and lumigraph, the
predecessors in the author ’s lab have developed a real-time system for
capturing and rendering a simplified dynamic light field called the
“plenoptic videos” [Chan 2003], [Chan 2004], [Chan 2005], [Chan
2009], [Gan2005] with four dimensions. It is a simplified dynamic light
field, where videos are taken along line segments as shown in Fig. 3-1,
8/17/2019 FullText (36)
40/144
21
instead of a 2D plane, to simplify the capturing hardware for dynamic
scenes.
Figure 3-1: Plenoptic videos: Multiple linear camera array of 4D simplified dynamic
light field with viewpoints constrained along line segments. The camera arrays
developed at [Chan 2009]. Each consists of 6 JVC video cameras.
Pioneer projects in cultural heritage preservation of large scale
structure and sculptures include the Digital Michelangelo Project [Levo
2002], the 3D facial reconstruction and visualization of ancient Egyptian
mummies [Atta 1999], the great Buddha Project [Ikeu 2003], to name
just a few. To avoid possible damage to the ancient artifacts and speedup the capturing process, we propose to employ the image-based
approach instead of using 3D laser scanners. A circular array consisting
of multiple digital still cameras (DSCs) was therefore constructed in this
thesis to capture the simplified light field of the ancient artifacts along
circular arcs, which we shall call the simplified circular light field
(SCLF) or circular light field (CLF) in short. The circular array is chosen
to provide users with a better visual experience, because it supports fly
over effect and close-up of the artifacts uniformly in the angular domain.
We also developed novel techniques for rendering new views of the
ancient artifacts from the images captured using the object-based
approach. The details will be discussed later in Chapters 4 and 5. A
number of ancient Chinese artifacts from the University Museum and
8/17/2019 FullText (36)
41/144
22
Art Gallery at The University of Hong Kong were captured and
excellent rendering results in ordinary as well as 3D/multiview displays
were achieved. The proposed IBR system and associated algorithms
server as a framework for culture preservation of media-sized ancient
artifacts.
While there are considerable IBR systems proposed previously, few
IBR systems are moveable. Therefore, another objective in this thesis is
to design a moveable IBR system for modeling objects in outdoor
environment. The moveable IBR system proposed uses a linear camera
array consisting of 8 video cameras mounted on an electrically
controllable wheel chair. Its motion can be controlled manually or
remotely by means of additional hardware circuitry. Unlike the previous
multiple camera systems which are not designed to be moveable so that
the view-points are somewhat limited and usually cannot cope with
moving objects and perform 3D reconstruction of objects in open
environment. Our moveable image-based rendering system can be used
to render large environment and moving objects. In particular, the
system supports object-based rendering and 3D reconstruction capability
and consists of two main components. 1) A novel view synthesis
algorithm using a new segmentation and mutual-information (MI)-based
algorithm for dense depth map estimation, which relies on segmentation,
LPR-based depth map smoothing and MI-based matching algorithm to
iteratively estimate the depth map. The method is very flexible and both
semi-automatic and automatic segmentation methods can be employed.
They rank fourth and sixth, respectively, in the Middlebury comparison
of existing depth estimation methods. This allows high quality
renderings of outdoor scenes with improved mobility/freedom to be
obtained. 2) A new 3D reconstruction algorithm which utilizes
8/17/2019 FullText (36)
42/144
23
sequential-structure-from-motion (S-SFM) technique and the dense
depth maps estimated previously. It relies on a new iterative point cloud
refinement algorithm based on Kalman filter (KF) for outlier removal
and the segmentation-MI-based algorithm to further refine the
correspondences and the projection matrices. The mobility of our system
allows us to recover more conveniently 3D model of static objects from
the improved point cloud using a new robust Radial basis function
(RBF)-based modeling algorithm to further suppress possible outliers
and generate smooth 3D meshes of objects. Experimental results show
that the proposed 3D reconstruction algorithm significantly reduces the
adverse effect of the outliers and produces high quality renderings using
shadow light field and the model reconstructed. The details will be
discussed later in Chapters 4 and 5.
The rest of this chapter is devoted to the general design and
construction of the systems. More precisely, Section 3.2 is devoted to the
design and configuration of the IBR systems. Section 3.3 presents some
pre-processing including camera calibration, color-tensor-based
segmentation and matting. Finally, conclusions are drawn in Section 3.4.
3.2 Construction of the Proposed IBR systems
3.2.1 Still Camera System
As mentioned previously in Section 3.1, the first prototype system
consists of an array of 13 Canon 550D cameras mounted on a camera
stand. The images/videos will be captured and then be processed and
viewed on a multiview TVs. A circular array is chosen to provide users
with a better visual experience, because it emulates “fly over” and
“rotate” kind of special effects. Fig. 3-2 shows the proposed capturing
system. Fig. 3-3 shows some snapshots captured by this system called
8/17/2019 FullText (36)
43/144
24
Buddha and Dragon Vase. The resolution of these images is 3465×2304.
The operation flow is illustrated in Fig. 3-4. Firstly, the objects are
captured by this system from different angles. Then we need to segment
the objects by color tensor which is insensitive to shadow and shading.
The natural matting can be adopted to improve the rendering quality
when objects are mixed on other backgrounds. From the segmented
objects, approximated geometry information for each object can be
estimated by point-based matching for rendering and 3D reconstruction.
Finally other rendering techniques such as shadow field re-lighting and
view dependent texture mapping will be added in the rendering. The
details of these algorithms will be discussed in the rest of the current
chapter and next chapter. .
Figure 3-2: Circular camera array constructed.
8/17/2019 FullText (36)
44/144
25
(a)
(b)
Figure 3-3: Snapshots: (a)Buddha (b) Dragon Vase.
8/17/2019 FullText (36)
45/144
26
Figure 3-4: Block diagram of the proposed IBR system.
3.2.2 Moveable Camera System
The second moveable IBR (M-IBR) system consists of a linear
array of cameras mounted on an electrically controllable wheel chair so
as to cope with moving objects in large environment and hence improve
the viewing freedom of users. Fig. 3-5 shows the moveable IBR system
that we have constructed. It consists of a linear array of 8 Sony HDR-
TGIE high definition (HD) video cameras which is mounted on a
FS122LGC wheel chair.
The motion of the wheel chair is originally controlled manually
through a VR2 joystick and power controller modules from PG drives
[PGDT] technology. To make it electronically controllable, we
examined the output of the joystick and generated the (x-,y-) motion
control voltages to the power controller using a Devasys USB-I2C/IO
[USBI] micro-controller unit (MCU). By appropriately controlling these
voltages, we can control the motion of the wheel chair electronically.
8/17/2019 FullText (36)
46/144
27
Figure 3-5: The proposed moveable image-based rendering system.
Moreover, by using the wireless LAN of a portable notebook mounted
on the wheel chair, its motion can be controlled remotely. By improving
the mobility of the IBR capturing system, we are able to cope with
moving objects in large environment.
The HD videos are captured in real-time into the storage cards of
cam-corders. They can be downloaded to PC for further processing such
as calibration, depth estimation, and rendering using the object-based
approach. For real-time transmission, the cam-corders are equipped with
a composite video output which can be further compressed and
transmitted. To illustrate the concept of multiview conferencing, a
ThinkSmart IVS-MV02 Intelligent Video surveillance system [IVS] was
used to compress the (320x240) 30 frames/sec videos online, which can
be retrieved remotely through the wireless LAN for viewing or further
8/17/2019 FullText (36)
47/144
28
processing. The system is built from Analog Device DSP and real-time
compression at a bit rate of 400kbps.
Before the cameras can be used for depth estimation, they must be
calibrated to determine the intrinsic parameters as well as their extrinsic
parameters, i.e. their relative positions and poses. This can be
accomplished by using a sufficient large checkerboard calibration
pattern. We follow the plane-based calibration method [Zhan 2000] to
determine the projective matrix of each camera, which connects the
world coordinate and the image coordinate. The projection matrix of a
camera allows a 3D point in the world coordinate be translated back to
the corresponding 2D coordinate in the image captured by that camera.
This will facilitate depth estimation. Fig. 3-6 shows snapshots of an
outdoor and indoor videos captured by the proposed system called
“ podium” and “ presentation” , respectively. The resolution of these real-
scene videos is i10801920 with 25frames per second (fps) in 24-bit
RGB format. The system flow of the proposed moveable IBR system is
summarized in Fig. 3-7. Firstly we need to stabilize the video to reduce
the shaky motion frequently encountered in typical moveable IBR
systems. Then, a novel view synthesis algorithm using a new
segmentation and mutual-information (MI)-based algorithm for dense
depth map estimation is used to iteratively estimate the depth map.
Finally we need to reconstruct the 3D model using a new 3D
reconstruction algorithm which utilizes sequential-structure-from-motion
(S-SFM) technique and the dense depth maps estimated previously. A
new robust radial basis function (RBF)-based modeling algorithm is
used to further suppress possible outliers and generate smooth 3D
meshes of objects.
8/17/2019 FullText (36)
48/144
29
(a)
(b)Figure 3-6: Snapshots of the plenoptic videos at a given time instance: (a) is the
“ Podium” outdoor video from camera 1 to camera 4 and (b) is the “ Presentation”
indoor video from camera 1 to camera 4.
8/17/2019 FullText (36)
49/144
30
Video
stablization
Segmentation-
MI-based Depth
Estimation
Depth Map
Refinement
Image-based
Rendering
3D
Reconstruction
Figure 3-7: Block diagram of the proposed M-IBR system constructed.
3.3 Pre-Processing
3.3.1 Still Camera System
In order to speed up the whole processing procedure on the
proposed still camera system, some pre-processing needs to be done at
start. At first, all the cameras need to be calibrated. Because this system
is still, intrinsic and extrinsic parameters of cameras can obtained
precisely by following the plane-based calibration method. The proposed
still camera system will only focus on the objects we are interested. The
objects will be segmented out of the images for reducing the noise from
background.
3.3.1.1 Camera Calibration
In computer vision, the link between the 3D real world points and
image pixels is the camera parameters. The camera parameters contain
the extrinsic parameters and intrinsic parameters. Estimation of the
extrinsic and intrinsic parameters is called camera calibration [Truc
8/17/2019 FullText (36)
50/144
31
1998]. The extrinsic parameters define the translation between the
camera reference frame and the world reference frame. A 3D translation
vector T and a 3 × 3 rotation matrix R are used to represent the extrinsic
parameters [Truc 1998]. The relationship (see Fig. 3-8) between the
point in the world and the camera frame is
XW
Pc
PW
Yc
Xc
Zc
YW
ZWR, T
Figure 3-8 Relationship between the world coordinate and the camera coordinate.
The intrinsic parameters are defined in form of camera matrix C:
where x f and y f represent the focal length of the camera in terms of x
and y direcition. x
c and yc are the coordinate value of the principal point.
d is the skew parameter which is zero for pinhole cameras. For the
uncertainty of the camera types, the skew parameter will be set. By
combining the extrinsic parameters and intrinsic parameters, perspective
projection matrix equation will be:
)( T P R P wc . (3-3-1)
100
0 y y
x x
c f
cd f
C , (3-3-2)
8/17/2019 FullText (36)
51/144
32
where (c x , c y , c z ) is the point in the image coordinate system and
(w
X ,wY , w Z ,1) is the point in the world coordinate system. “ ” is dot
product. Both of the coordinate systems are in homogeneous coordinate
system. By defining the projective matrix P as
where P is a 3 × 4 matrix, the equation (3-3-3) can be rewriten as
Camera calibration is to estimate the matrix P . Zhang has proposed
an algrothm for camera calibration by using planar pattens [Zhan 1999].
The planar pattern is usually chosen as a chessboard like plane as shown
in Fig. 3-9, which is used in our system.
1
W
W
W
c
c
c
Z
Y
X
z
y
x
T | I CR , (3-3-3)
T | I CR P ,(3-3-4)
1
W
W
W
c
c
c
Z
Y
X
z
y
x
P . (3-3-5)
8/17/2019 FullText (36)
52/144
33
Figure 3-9 Planar patten.
The basic procedure of Zhang’s algorithm is:
1. Take a few image of the test pattern in different orientations.
2. Detect the feature points in the test images (often the corners).
3. Estimate the five intrinsic parameters (no skew paramter) and the
extirnsic paramters by using the closed-form solutions.
4. Estimate the radial distortion by solving linear least squares.
5. Refine all the parameters by minimizing error functions.
In this work the plane-based algorithm will be changed slightly to
fit my situation. The skew parameter will be added and the distortion
will not be estimated at first. Not only radial distortion, but also the
tangential distortion will be estimated.
3.3.1.2 Color-Tensor-based Segmentation and Matting
The first step to process these images is to segment the objects out
of the images. In the still camera system, we employ the photometric
invariant features [Weij 2006] to extract the foreground from the
8/17/2019 FullText (36)
53/144
34
monochromatic screen background. More precisely, the color tensor
describes the local orientation of a color vector f ( x, y) as:
y
T
y x
T
y
y
T
x x
T
x
y x f f f f
f f f f
T ),( , (3-3-6)
where f ( x, y) is a vector which contains the color component values at
position ( x, y) and the subscripts x and y in ),( y x xf and ),( y x yf denote
respectively the derivative of f ( x, y) with respect to x and y, the image
coordinates. According to [Weij 2006], the color vector can be seen as a
weighted sum of two component vectors: )(],,[ iibbT mme BG R c c
where bc is the color vector of the body reflectance, ic is the color
vector of the interface reflectance (i.e. specularities or highlights), bm
and im are scalars representing the corresponding magnitudes of
reflection and e is the intensity the light source. Thus
ii x xi
b xbb x xbb
T
x
memememeemG B R
c c c
))(( ))(()(],,[
,
(3-3-7)
which suggests that the spatial derivative is a sum of three weighted
vectors, successively caused by body reflectance, shading-shadow and
specular changes. For matte surfaces, the intensity of interface
reflectance is zero (i.e. mi=0) and the projection of the spatial derivative
xf on the shadow-shading axis is the shadow-shading variant containing
all energy which can be explained by changes due to shadow and
shading. The shadow-shading axis direction is bc which is parallel to
bbem c f for matte surfaces. So the projection 1s of the spatial
derivative xf on the shadow-shading axis is
8/17/2019 FullText (36)
54/144
35
||||/||)||/(1 f f f f f s T
x.
(3-3-8)
Subtraction of the shadow-shading variant1s from the total derivative
xf results in the shadow-shading quasi-invariant 12 s f s x . In
summary, the derivative of the color tensor can be separated into
shadow-shading variant part1s and shadow-shading invariant part
2s .
The shadow-shading invariant part does not contain the derivative
energy caused by shadows and shading. To construct a shadow-shading-
specular quasi-invariant, this part is combined with the hue direction,
which is perpendicular to the light source direction c i and the shadow
and shading direction c b. Therefore the hue direction is
bibi c c c c h /)( .
(3-3-9)
The projection of the derivative on the hue direction is the desired
shadow-shading-specular-quasi-invariant part:
||||/||)||/( h h h h f H T
x . (3-3-10)
By replacing xf in the color tensor equation (1) by 2s or H , we can get
the shadow-shading-specular-quasi-invariant color tensor and the
shadow-shading invariant color tensor respectively. By setting a suitable
threshold value for the color tensor, we can detect the boundary of the
object. Fig. 3-10 shows some segmentation results that were obtained
using the color tensor method, followed by Bayesian matting for
extracting a foreground from the background. After segmentation, the
hard boundary of the object will be obtained. Matting can then be
applied to obtain soft segmentation information, called the matte, of the
object. The matte, which is an image containing the portion of
foreground with respect to the background (from 0 to 1) at a particular
8/17/2019 FullText (36)
55/144
36
location, greatly improves the visual quality of mixing the objects onto
other backgrounds.
(a)
(b)
Figure 3-10:(a) Extraction results using color-tensor-based method. Left: original,
middle: hard segmentation, right: after matting. (b) Close up of segmentations in
(a). Left: hard segmentation, Right: after matting.
3.3.2 Moveable Camera System
Unlike the static camera system described before, the moveable
camera system will experience shaky motion during movement and
hence video stabilization has to be performed.
3.3.2.1 Video Stabilization
To ensure good tracking of objects and to obtain more image
samples for high quality rendering, the wheel chair is usually driven
steadily during capturing. However, one problem with M-IBR system is
8/17/2019 FullText (36)
56/144
37
that the ground surfaces may not be smooth and the whole mechanical
structure can vibrate considerably during movement. In our M-IBR
system, the shaky motion of the camera array of the outdoor
environment seems to come from the roughness of the ground surfaces
and the vibration of the mechanical structure during the movement.
Besides, the video captured may also appear shaky when the system is
moving and about to settle down in indoor environment. To reduce these
annoying effects, video stabilization [Hu 2007], [Mats 2005], [Mats
2006], [Rata 1998] is frequently employed to eliminate the undesired
motion fluctuation in the captured videos.
As mentioned above, our M-IBR system was driven steadily during
capturing. Therefore, the undesired motion fluctuation will usually
appear as high frequency components compared to the intentional
motion. As a result, the problem of video stabilization can also be
viewed as the removal of high frequency components in the estimated
velocity. To this end, one needs to estimate the global motion of the
camera, say by mean of optical flow on the video sequence, so that this
annoying high frequency local motion can be removed to stabilize the
videos.
The proposed video stabilization algorithm is divided into three
major steps as follows. 1) Global motion estimation: firstly, the
geometric transformation between a location T x x ],[ 21x in a frame
with that in an adjacent frame, 'x , is modeled by an affine transformation
t Ax x T x ][' , where T x x t t ],[ 21t is the translational component
and the affine rotation, scaling, and stretch are represented by the matrix
43
21
aa
aaA . In homogeneous coordinate, T
h x x ]1,,[ 21x , T can be
8/17/2019 FullText (36)
57/144
38
conveniently represented by a matrix multiplicationhh
x T , where
10
t AT h . T is estimated from the tracked features in adjacent video
frames using the scale invariant feature transformation (SIFT) [Lowe
2004], instead of the Lucas-Kanade tracker in [Chan 2010]. 2) Local
smoothing of motion: the intentional motion, which is assumed to be
slow and smooth, is then obtained by smoothing the global motion
estimated using local polynomial regression (LPR) with adaptive
bandwidth selection [Zhan 2009]. Unlike conventional methods, the
bandwidth or window size for smoothing can be automatically
determined. This will be further discussed below. 3) Video Completion:
the uncovered areas are filled using motion inpainting [Mats 2005],
[Mats 2006].
We now describe each step in more details. Let N t I t ,,0|)( x ,
where T x x ],[ 21x , 111 x , 221 x , be a video sequence
consisting of N video frames with resolution 21 captured by our
M-IBR system. Consider the global motion transformation up to time
instant t , },,{ 11
0
t
t T T , where1i
iT is the coordinate transformation from
the i-th to the (i+1)-th frame. If 1iiT is smoothed separately, a smoothed
transformation chain }ˆ,,ˆ{ 11
0
t
t T T is obtained and the t-th compensated
image frame t I' can be obtained as:
)(]))[ˆ(( 1
0
1
1 x x T T t t
i
i
i
i
it I I
, (3-3-11)
wherei
i 1T and1ˆ i
iT denote respectively the transformation from
frame i+1 to i and the smoothed transformation from frame i to 1i . In
order to avoid error accumulation due to the cascade of original and
8/17/2019 FullText (36)
58/144
39
smoothed transformation chains, [Mats 2005] proposed to compute
directly the transformationt
T ~
from the current frame )(x t I to the
corresponding motion compensated frame )(x t I using only the
neighboring transformation matrices as
t i
i
t t iG )(~
T T , where
t f t :{ } t f denotes the indices of neighboring frames,
22 2/1)2()( xe xG is a Gaussian kernel, 2 is the support of t
or window size, and denotes the element-wise convolution operation.
It can be seen that the selection of the kernel size affects the degree
of smoothing. A large kernel size will lead to the problem of over-
smoothing, while a small kernel size may not be able to remove the high
frequency undesirable motion. The green and black lines in Fig. 3-11
illustrate the effect of using a small kernel size of 3 and a large
kernel size of 20 , respectively, using the method in [Mats 2005].
To address this issue, we propose a new method for choosingadaptively the kernel size using local polynomial regression (LPR) with
adaptive bandwidth selection. The close relationship between curve
fitting and video stabilization has been recognized for example in [Hu
2007], where a local parabolic fitting is used to compute the smoothed
motion path. However, the kernel size is also fixed. The advantage of
our method is that the kernel size can be adaptively selected from the
data.
LPR is a very flexible and efficient nonparametric regression
method in statistics, and it has been widely applied in many research
areas such as data smoothing, density estimation, and nonlinear
modeling. Given a set of noisy samples of a signal, the data points are
fitted locally by a polynomial using the least-squares (LS) criterion with
8/17/2019 FullText (36)
59/144
40
a kernel function having certain bandwidth parameters. Since signals
may vary considerably over time, it is crucial to choose a proper kernel
size or local bandwidth to achieve the best basis-variance tradeoff. In
this paper, we used the refined intersection of confidence intervals (R-
ICI) method to perform bandwidth selection. Here, we follow the
homoscedastic data model of the time series:
iiii X X mY )()( , (3-3-12)
where },,2,1|),{( ni X Y ii are a set of uninvariate observations,
)( i X m is a smooth function specifying the conditional mean of iY
given
i X , and i is an independent identically distributed (i.i.d.) additive
white Gaussian noise. The problem is to estimate )( i X m and its k-th
derivative )()( ik X m from the noisy sample
iY so as to achieve
smoothing. Since )( i X m is a smooth function, we can approximate it
locally as a general degree- p polynomial at a given point 0 x :
))(()()( 000 x x xm xm xm
p
p
x x p
xm x x
xm)(
!
)()(
!2
)(0
0
)(
2
0
0
,)()( 0010 p
p x x x x
(3-3-13)
where x is in the neighborhood of 0 x and k ),,1,0( pk is the k-th
polynomial coefficient. The coefficient vector T p
],,,[ 10 β at
location 0 x can be obtained by solving the following weighted least-
squares (WLS) regression problem:
8/17/2019 FullText (36)
60/144
41
}])()[({min1
2
000
n
i
p
k
k
ik iih x X Y x X K β
, (3-3-14)
where h K x X K h
x X
ihi /)()( 00
, )( K is a kernel function with
bandwidth parameter h, which emphasizes the influence of neighboring
observations around0 x in the estimation. The parameter h is adaptively
chosen at different locations 0 x so as to adapt to the local characteristics
of the signal (i.e. the intentional motion path). Differentiating the
objective function in (3-3-14) with respect to β and setting the
derivative as zero, we get the following LS solution in the matrix form:
Wy X WX X β T 1 T )()(ˆ 0 ,h x , (3-3-15)
where
P
nn
P
P
x X x X
x X x X
x X x X
)()(1
)()(1
)()(1
00
0202
0101
X , T
nY Y Y
21y , and
)}({ 0 x X K diag ih W is the weighting matrix.
By estimating )(ˆ0 ,h x β with an optimized bandwidth h at different
0 x , we obtain a smoothed representation of the data from the noisy
observations. In the context of video stabilization, a key problem of
applying LPR is thus to select an optimal bandwidth parameter h to
achieve the best bias-variance tradeoff in estimation. Here, we use the R-
ICI bandwidth selection algorithm [Zhan 2008] to select the optimal
bandwidth. The basic idea of the R-ICI adaptive bandwidth selection
method is to calculate a set of smoothing results with different
bandwidths and then to examine a sequence of confidence intervals of
these smoothing results to determine and refine the optimal bandwidth.
In this thesis, the kernel )(u K is chosen as the Epanechnikov
8/17/2019 FullText (36)
61/144
42
kernel )1)(4/3()( 2
uu K , and the bandwidth parameter set for R-ICI
is }10,,1,/:|{ j N ahh j j j with 2.1a . N is the total number
of frames. The details of the algorithm are omitted and interested readersare