3D Reconstruction from a Single Still Image Based on Monocular Vision of an Uncalibrated Camera

: we propose a framework of combining Machine Learning with Dynamic Optimization for reconstructing scene in 3D automatically from a single still image of unstructured outdoor environment based on monocular vision of an uncalibrated camera. After segmenting image first time, a kind of searching tree strategy based on Bayes rule is used to identify the hierarchy of all areas on occlusion. After superpixel segmenting image second time, the AdaBoost algorithm is applied in the integration detection to the depth of lighting, texture and material. Finally, all the factors above are optimized with constrained conditions, acquiring the whole depthmap of an image. Integrate the source image with its depthmap in point-cloud or bilinear interpolation styles, realizing 3D reconstruction. Experiment in comparisons with typical methods in associated database demonstrates our method improves the reasonability of estimation to the overall 3D architecture of image’s scene to a certain extent. And it does not need any manual assist and any camera model information.


Introduction
We all know that the 3D scene has more valuable information than 2D scene because of including the depth information. However, the acquisition for 3D scene is usually complex and difficult. It often needs many cameras to take pictures at multi-view angles and needs the techniques of camera's calibration, image matching etc or needs the complex sophisticated depth sensor instruments to realize. In reality, sometimes, we cannot acquire many images from different view angles to reconstruct 3D scenes because of the limited conditions, such as the vehicles in the distance, ships in the sea, or the stars in the interstellar space etc, but we are still interested in understanding the hierarchy relationships of front and back among the objects. At these situations, we only can hope to utilize the information from monocular vision to recover or approximate the real 3D effects directly.
At present, with the rapid development of Machine Learning, some investigators utilize these theories for training associated models such as Bayes [18,19], Adaboost [20][21][22][23] and Markov random field(MRF) [24][25][26] etc to estimate the structural information of scenes. However, pure machine learning often brings the problem of under-fitting or over-fitting, which often leads to incorrect or inaccurate estimation for the whole 3D structure of scenes. To make up the deficiency, since optimization's theory and methods can deal with selecting the best alternative in the sense of the given objective function, we consider of introducing the strategy of optimization upon the Machine Learning for estimating the final whole 3D structure of scenes.
In this paper, a method of combining Machine Learning with Dynamic Optimization is used to reconstruct 3D scene directly from a single still image based on monocular vision of an uncalibrated camera. It improves the reasonability of estimation to the whole 3D structure of images' scenes to a certain extent and doe not need any manual assist and any camera model information.
The remainder of this paper is arranged as follows. Section 2 interprets the principle of 3D reconstruction from a single still image based on monocular vision of an uncalibrated camera; Section 3 provides relative experiment and results analysis; Section 4 concludes this paper.

2.
The Principle of 3D Reconstruction From A Single Still Image Based on Monocular Vision of An Uncalibrated Camera

Global Principle
Comparing with 3D image, the monocular vision produces 2D image, which has only the information at orientations of X and Y, without Z. That is to say, it has not the depth information. So, in order to realize 3D reconstruction, we need to estimate the depth information from 2D image.
If we close our one eye, with the other one open to see a picture, we still can feel the relationship of different parts' back and forth to the object in image. To the complex image, according to some rules, we often can also infer its relationship of back or forth with other objects around and further infer the whole architecture of depth from an image. These phenomena provide us with some possibilities for the 3D reconstruction of the scene from monocular image. Here, we think that the perception above maybe some past experience and the rules above maybe some kind of optimization. So, this paper will attempt to combine Machine Learning with Dynamic Optimization for the computer to solve this problem automatically.
In a picture, usually we will see some familiar or unfamiliar object, which involves shape, material, texture, color, and the effect of illumination. And we usually can also accept the inferences that: A. For a familiar object, if it only can be seen part shape, it is usually sheltered by something.
B. For the material, the part near to us is usually clearer and rougher than the part in distance.
C. For the texture, usually, the near part presents sparser, however, the part in distance presents denser.
D. For the effect of lighting, the part near light source is usually brighter than the part far away.
E. For the color, it usually changes with different objects or different parts of one object or even different lighting sources etc.
According to the experiences above: we can use some samples owning different depth levels for the material, texture and brightness of pictures respectively to train the relative learning machines; then we use some priority to infer the depth level of some area in image and for some objects with familiar shapes, we can use some decision algorithm to infer the possible existed occlusions; finally, to the whole image, we can use optimization algorithm to integrate with all the inferences above; thus, we will acquire multi-possible architectures of depths for an image and we will select the most possible one as the final result according to the computational optimal value or the biggest probability value. In addition, because we do not use any camera's model information, here, the depths studied are not real absolute depths, but the relative depths of the different components in an image.  Fig.1 the principle of the 3D reconstruction Fig.1 gives this paper' main principle of the 3D reconstruction directly from monocular 2D image, in which: the source image is segmented with bigger scale and smaller scale in turn, to make occlusion identification and depth integration detection on material, texture and lighting; then the global depth architecture is estimated with dynamic optimization based on the two stages' results above; at last, combining the global depth architecture with source image, the 3D image reconstruction attached with bilinear interpolation or point-cloud is realized. Next, we will describe each part of the principle in detail.

Implementation
To estimate the depth structure reasonably, the key is need to make correct image representation as much as possible to the image's important information. Image representation is usually hierarchical, so it is carried on the basis of the image segmentation in different scales [27,28], in which the image's different components represent different meanings.

Occlusion Identification
First, we segment the image in bigger scale with method in Ref. [27] to acquire familiar objects and identify the associated occlusions. Here, the occluding phenomenon is defined as: when object A can only be seen in partial shape and the other area where should be embodied as the rest shape of A, is occupied by object B, then we think that object A is occluded by object B, or B is in front of A. The occluded areas' divisions refer to [29]. The related principle is as Eqs. (2-1) displaying, in which M, N are two neighboring objects in an image; X represents the associated traits; Y is the associated prior knowledge for familiar objects and Q is the number of objects in an image. According to Bayesian rule, the more remarkable the traits information P(X/(M, N)) and the prior information P(Y/(M, N, X)) are, the bigger the value of posterior probability function P((M, N)/(X, Y)) on occlusion will be. , , , ) , , , To find out all the familiar objects' occlusive relationships in an image, we adopt a kind of searching tree algorithm. The searching process is just as Fig.2 After the occlusion identification is over, then, the image will be segmented again in smaller scale with method in Ref. [27] to acquire the superpixels, each of which represents a coherent region with similar properties such as lighting, texture, material or colors. Next, we process the superpixels' image above from the aspects of lighting, texture and material to extract more depth information. Here, before further depth identification, we predefine multiple depth classes according to different changing extent of lighting, texture and material with depth changing. We use the logistic regression version of AdaBoost algorithm in Ref. [30] to train each class of depth samples and acquire relative detector. At each class of depth, the relative detector based on textural or material traits' weighting is trained and the relative lighting model is also trained. In identifying, according to some priority order, we integrate the detectors to acquire the integration depth on material, texture and lighting. Fig.4 displays the associated detecting principle, in which the property of AdaBoost that uses weak classifiers to construct strong classifiers is applied for both the single depth class detection and the integration depth detection.
On the selection of associated traits: the textural energy which are computed from Law's masks in Refs. [31,32] act as textural features to embody the extent of textural denseness at different depths; the Haar-like traits in Ref. [33] act as material features to describe the extent of material smoothness and ambiguity with depth changing; the lighting model on depth refers to [34,35], in which the distance from light source to viewer is regarded as the relative depth herein, then image's other components are assigned to corresponding lighting categorical depth classes and model parameters in turn.

Global Depth Architecture
To acquire the final global depth architecture of an image, we combine with all the inferences of the factors above and apply a kind of dynamic optimization with constrain conditions to solve the problem. In addition, image's gray values usually reflect some depth information, so it will also be regarded as one base applied in the estimation for the final global depth architecture. The corresponding formula is Eqs. (2-2), where: m, n represents two neighboring areas in an image; g(m,n,d) is the depth mapping function between neighboring area m and n corresponding to the constrains above. Herein: q=1, base on gray image; q=2, base on lighting; q=3, base on texture; q=4, base on material; q=5, base on occluding; λ and μ are associated Lagrange multipliers; L is the total number of correlations corresponding to neighboring areas in an image; d is the final depth correlation to be established for all the smaller areas of an image; c is introduced slack variable used to transform the inequality constrains above to equality constrains. 5 5 2 Choose new penalty parameter μ k+1 ≥ μ k ; Set starting point for the next iteration to d s(k+1) = d k ; end (for) According to the method described above, the vector d, λ and μ are integrated updating until convergence. Note that the vector d refers to relative depth, not absolute depth. Fig.5 illustrates an example of the estimation to an image's global depth architecture with the method of this paper, in which the different gray levels indicate different depths.

3D Reconstruction Realization
At last, after acquiring the complete depthmap of an image, we combine the depthmap with its source image to reconstruct the 3D image. And here, two kinds of methods are used to process the voxels in 3D image: to the images distributed mainly with continuous qualities, the bilinear interpolation method is applied to complementing and smoothing the color area among the reconstructed 3D pixel points; to the images distributed mainly with discrete qualities, the point-cloud method is applied directly to the reconstruction of the 3D image's voxels. In addition, during the course of 3D image reconstruction, to the borderlines of the segmentation before, we use the corresponding pixels in original image to replace them. Fig.6 gives an example of reconstructed 3D images viewed in different visual angles from a single 2D source image. From different view angles, we can see that they all approximate the usual real objects and scenes. 6

Testing Environment
In order to prove the ability of the application in general environment for the method proposed in this paper, all the experiments are conducted in the environment of Microsoft visual c++ 6.0 at the platform of Pentium 1.73GHz personal computer. The framework of 3D reconstruction from monocular vision proposed by this paper is tested with the Make3D Range Image Data' dataset1 [24,25] , which consists of 534 image+depthmap pairs, with an image resolution of 2272×1704 and a depthmap resolution of 55×305.
In our experiment, 400 of the images/depthmaps are used for training associated detectors and the remaining 134 for testing. At the same time, we will compare the experimental results with typical Saxena [24] method and HEH [23] method. And for fairness, their depthmaps are scaled and shifted before computing the errors to match the global scale of our test images. Fig.7 gives some examples of real effects for contrasting typical methods -Saxena [24] and HEH [23] with ours. In comparisons: HEH's generating a popup effect by folding the images at "ground-vertical" boundaries is not feasible for all the images. Especially for multy non-ground and discontinuous areas with front and back relationships in an image, it often fails to describe these scenes' structure correctly, such as fig.7-b1, 7-b2. Saxena's using Markov Random Field (MRF) to infer both 3D location and orientation of the patches in an image without any explicit assumptions about the structure of the scene makes it generalize well, even to scenes with significant nonvertical structure. The deficiency exists in that at the aspect of the whole structural reconstruction, it lacks the comprehensive considering for many factors, so its global 3D structural estimation is sometimes inaccurate, such as fig. 7-c3. Our algorithm's global dynamic optimization seems to be more reasonable and feasible for the whole 3D architectural reconstruction to a certain extent, producing better effects although a few small local parts in images are estimated somewhat coarsely.  Fig.7 Examples of 3D reconstruction from single image with different algorithms Table.1 and Fig.8 give the final statistical results for the testing comparisons of our algorithm with the two typical algorithms above, in which: whether 'Depth error' or 'Relative Depth error' are both averaged over all pixels in the hold-out test set; 'Correct rate' means percent of models qualitatively correct; 'Planes Correct rate' is percent of major planes correctly identified. Combing with the analysis before, we can see that from the whole testing data, Saxena's method is a little better than ours probably because of our method's accumulated several local small estimation errors during the course of computation and our method outperforms HEH mainly for HEH' sometimes failed estimation for the whole structure of scenes, all of which verifies our algorithm's feasibility and rationality further.

Conclusion
This paper mainly proposes a framework of 3D reconstruction from a single still image of unstructured outdoor environment based on monocular vision of an uncalibrated camera. It integrates machine's learning for each factor with dynamic optimization's considering comprehensively for multy-factors to estimate scene's 3D structure and does not need any manual assist, any camera model information. In experiment, we use Make3D Range Image Data to compare it with typical methods of HEH and Saxena. The results basically confirm our method's reasonability and feasibility for image scene's global architectural 3D reconstruction to a certain extent.
In the next work, our investigation will manage to improve the accuracy of estimation not only for the scene's local patches especially, but also for the global structure further. Meanwhile, whether this paper's method can apply to other databases needs new testing.