Monocular Height Estimation Method with 3 Degree-Of-Freedom Compensation of Road Unevennesses

Height estimation of objects is a valuable information for locomotion of autonomous robots and vehicles. Even though several sensors such as stereo cameras have been applied in these systems, cost and processing time have been motivating solutions with monocular cameras. This research proposes two new methods: i) height estimation of objects using only a monocular camera based on flat surface constraints and ii) 3 degree-of-freedom compensation of errors caused by roll, pitch and yaw variations of the camera when applying the Flat Surface Model. Experiments outdoors with the KITTI benchmark data (4997 frames and 436 objects) resulted in improved accuracy of the estimated heights from a maximum error of 1.51 m to 1.12 m and reduced number of estimation failures by 4 times, proving the validity and e ffectiveness of the proposed method.


Introduction
This paper is an extension of work originally presented in the 2016 IEEE/SICE International Symposium on System Integration [1]. This previous work proposed a monocular height estimation method by chronological correction of road unevenness, which basic experiments in laboratory and asphalt correcting camera pitch variations proved the validity of the method. However, since only pitch variations were considered and experiments were conducted on asphalt in one environment, several items remained as future work. Therefore, we extended our work by the following: • Analysis and 3 degree-of-freedom (3DOF) compensation of errors caused by roll, pitch and yaw variations.
• Extended experiments on asphalt with several conditions of road and objects, permitting further analysis of external disturbances.
Cameras have been applied in many fields such as robotics and autonomous driving for localization and object recognition [2]- [7]. Even though stereo cameras can provide the depth to obstacles, their higher cost and required processing have motivated several studies to estimate depth or height with monocular cameras. Height estimation permits the robot to detect and avoid potential obstacles on the road, becoming a valuable information of the surrounding environment. Studies [8] and [9] estimate height of objects with a steady camera. While the first relies on a previous calculation of the vanishing point, the latter relies on a known object height in the scene. On the other hand, other studies focus on height estimation using a moving camera. Study [10] estimates height by computing the focus of expansion in the scene and segmenting ground and plane by sinusoidal model fitting in reciprocal-polar space. The method proposed in [11] estimates height of objects on the road by obstacle segmentation and known camera displacement from odometric measurements, refining the measurements with several frames. Although there are many promising height estimation methods, they still strongly rely on extra extraction of information from the scene or external sensors. Moreover, the presented methods assume that the ground is flat and no discussions or analyses of eventual pose variations of the camera were mentioned. In this context, we propose in this work a height estimation method that requires no previous information of the scene, nor information from A. M. Kaneko et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 3, 1443-1452(2017 external sensors. Furthermore, even though we assume flat surface, we analyze and compensate the effects of roll, pitch and yaw variations of the camera. The rest of this paper is organized as follows. Section 2 introduces the existing issues of monocular cameras due to depth ambiguity and roll, pitch and yaw variations. Section 3 explains the proposed monocular height estimation method and compensations of roll, pitch and yaw variations. The conducted experiments are detailed in section 4. The obtained results are discussed in section 5. Lastly, the conclusions and future work are provided in section 6.

Issues of Monocular Cameras
In this section, two main issues of monocular cameras are introduced: A) depth ambiguity and B) effects of roll, pitch and yaw variations.

Depth Ambiguity
In order to explain this limitation, we first briefly introduce the Flat Surface Model (FSM), a technique that permits easy relation of pixels (in camera coordinates) and meters (in real world coordinates), and is commonly applied with monocular cameras [12]- [14]. Figure 1 shows a moving body, a camera and flat surface. The camera is attached to the moving body at a known height H and angle α in relation to the flat surface. The line connecting the camera center and the flat surface with angle α is called principal ray. The coordinate system of the camera is defined by pixels (u,v), with a known and fixed vertical length V and a horizontal length W. The maximum angle seen by the camera (field of view) in both u and v directions are fixed and defined as angles FOVu and FOVv. The coordinate system defined by (X,Y,Z) is fixed on the moving body. If we assume that the surface is flat, the relation between a pixel (u,v) and its real position (X,Y,Z) in meters is given by (1) to (3), where β is the angle in relation to the principal ray. One common application of the FSM is to estimate the camera displacement by Visual Odometry (VO) [12]- [14], which is briefly explained in "Case A" of Figure 2. First, consider the moving body and camera described in Figure 1 running on a flat surface. In an initial position S N −1 , the camera takes a frame and shoots a point P R on the ground, computing Y N −1 = d R,N −1 . Next, consider that the moving body moves ∆d N in direction I of a coordinate system (I,J) fixed on the ground, reaching position S N . In this new position, it takes another frame, tracks point P R and Y N = d R,N is obtained by the FSM. Using the computed information Y N −1 and Y N from the two positions S N −1 and S N , the real camera displacement ∆Cam N can be correctly estimated by (4). By repeating the previous steps, the displacement of the moving body can be estimated on the following positions. Notice that we showed a simple case with only one point P R and direction I, but many points and directions can be considered in VO.
The depth ambiguity can be visualized in "Case B" of Figure 2, which contains an object of height h on the ground. Let's assume that in position S N −1 the camera shoots a point P R on the top of the object, and the projection on the flat surface is point P P ,N −1 , exactly on the same location as point P R in "Case A". Although points P R and P R belong to different (X,Y,Z) in the world, the FSM can't distinguish this ambiguity and computes Y N −1 = d R,N −1 = d R,N −1 . However, when the camera moves ∆d N in direction I, it tracks point P R and consequently, computes Y N = d R,N at position S N . Finally, the displacement ∆Cam N seen by the camera becomes as (5), what shows that the real displacement ∆d N is not correctly estimated due to the presence of the object.

Effects of Roll, Pitch and Yaw Variations
Even though the FSM assumes that the surface is perfectly flat, in fact small unevennesses exist. Such influences are detailed in Figure 3, with a simple analysis done on a real asphalt. A camera was fixed on a moving body and moved tracking four points: i) on a static object with height h (point O), ii) on the ground far from the camera (point F), iii) on the ground close to the camera (point C) and iv) on the ground with a distance between F and C (point M). In each frame, the distances Y N = d W ,N obtained by the FSM were computed. In order to better visualize the influences of unevennesses, the corresponding displacements ∆d W ,N = d W ,N − d W ,N −1 are also displayed in the figure. If the surface of the asphalt was really flat, then the displacements ∆d W ,N of each point C, M and F were expected to be the same in each frame. However, while this happened in frame f a = 2 for example (suggesting that the pose of the camera was exactly the one expected by the FSM), in frames f b = 5 and f c = 17 the displacements were different (suggesting that the pose of the camera was different from the one expected by the FSM). Moreover, Figure 3 shows another important pattern, which points closer to the camera have smaller magnitudes of ∆d W ,N : the magnitude of ∆d W ,N calculated with C in a determined frame is smaller than the one calculated with M, which is smaller than the one calculated with F, which is smaller than the one calculated with O in the same frame. This example clearly shows the effect of unevennesses on the FSM. A further analysis according to roll, pitch and yaw variations is presented in Table 1 and Figure 4. Table 1 shows the influence of pitch variations on the FSM. Consider that in a certain frame N a moving body that is shooting a point O R on a static object on the surface has its pitch changed (represented by σ p,N ) by an unevenness. This variation shifts height H to H p,N and therefore the camera computes the distances by the FSM in relation to a new wrong flat surface, FS p,N . In this wrong surface, the projection of point O R becomes P p,N and consequently, Y N = d p,N is calculated. However, the correct distance in such configuration is the one computed with G R , the projection of O R on the real surface on the ground, resulting in Y N = d R,N . The relation between the wrong (d p,N ) and correct (d R,N ) distances is shown by (6), (7) and (8), where l p is the axis of rotation and γ p,N is the angle between the camera height H and l p . It is important to notice that the presented equations don't explicitly contain the object height h. It happens because the computed Y N = d p,N itself contains this information, since it is function of O R , P p,N and h. We define the error caused by pitch variation σ p,N as ε p,N , which is the difference between the wrong estimated distance (d p,N ) and correct one (d R,N ), according to (9).   Table 1 describes the influence of yaw variations on the FSM. Consider that in a certain frame N, a moving body that is shooting a point O R on a static object on the surface has its yaw changed (σ y,N ). In the former pose (pose N ), the camera computes the distance to point G R as Y N = d y,N and X N = X y,N . However, the correct distance in Y direction in the new pose (pose N ) is d R,N . Equations (10) and (11) show the relation between the wrong and correct distances, where φ y,N is the angle in which the camera sees the object in the former pose. The error ε y,N caused by σ y,N is defined as (12).

Roll
Finally, Figure (c) of Table 1 (13), (14) and (15), where l r is the axis of rotation and γ r,N is the angle between the camera height H and l r . We define the error caused by roll variation σ p,N as ε p,N , which is the difference between wrongly estimated distance (d r,N ) and correct one (d R,N ), as shown in (16). Figure 4 (C) shows error ε r,N in function of camera roll variation (σ r,N ), X r and apparent distance to the point (Y N = d r,N ). We can observe that ε r,N increases with the increase of σ r,N , X r and Y N . For roll variations of σ r,N = 0.1 rad, ε r,N > 2.0 m can occur for the adopted range of the parameters in the example, but those errors are still smaller than the ones caused by pitch variation.

Proposed Method
This section is divided into three parts: A) proposed height estimation method, B) compensation of roll, pitch and yaw variations and C) proposed algorithm.

Proposed Height Estimation
Although the FSM has the ambiguity limitation when computing VO, in fact, the difference of the obtained displacements caused by the object (∆Cam N , ∆Cam N ) contains useful information. First, the presence of objects in the scene influence the apparent displacements in pixels seen from the camera. Such difference in apparent displacements is explored in [12] to find irregularities in the optic flow and detect precipices.
On the other hand, no further information can be extracted by existing techniques. Here, our method is based on the principle that since the FSM assumes known H and α, then we can assume that the resulting projections are also function of these dimensions. If we further observe the geometrical relations caused by the FSM and triangulation in two positions (Case B in Figure 2), we can in fact obtain geometrical relations in function of H and α, as shown in (17), (18) and (19). From these equations and (4), we obtain (20) to (22). Several relations can be observed from the equations. First, the object causes an extra amount of apparent displacement, defined as m N , and it is proportional to the object height h and camera height H. Second, the object height is function of the correct camera displacement ∆Cam N and wrong apparent displacement ∆Cam N . As afore mentioned, ∆Cam N can be estimated by traditional VO and even though this technique will be applied, it is not the focus of this work.

Compensation of Roll, Pitch and Yaw Variations
The equations displayed in Table 1 show that the compensation of the influences caused by roll, pitch and yaw are straightforward and can be easily done if σ p,N , σ y,N are σ r,N are known. Thus, for each acquired frame by the camera, we compute these variations and substitute them with the constant and known parameters of the vehicle (l p , l r ) and the FSM (α, H, FOV u , FOV v ) in (8), (11) and (15) (8), (11) and (15) the compensations strongly rely on the estimated roll, pitch and yaw, wrong estimations on the contrary lead to high errors. The filter works according to the observed in Figure 3. If a tracked point belongs to an object above the surface, then its apparent displacement must be bigger than the camera displacement (∆Cam N < ∆Cam N ). Therefore, we check this condition by calculating the apparent displacement of the object with (∆Cam c N ) and without (∆Cam nc N ) compensations. If one of them satisfies the condition, then we adopt this displacement as ∆Cam N . If both or none of them satisfies the condition, then we use the average of the two displacements to estimate the height, as show in (23). Finally, the median (h * N ) of the previous estimations (h 1 , h 2 , ..., h N −1 , h N ) is also applied to filter eventual noises, as (24).

Experiments
The proposed method was evaluated with data from the KITTI Vision Benchmark Suite (called hereon as "KITTI") [15] and processed with a computer Intel(R) Core(TM) i7-4600, 2.10 GHz, operating system Ubuntu (TM) 14.04, Eclipse (TM) development environment and OpenCV libraries [16]. Here, we want to verify the validity and effectiveness of the proposed height estimation and the proposed compensations of errors caused by roll, pitch and yaw variations. Thus, all estimated heights were done with two methods for later comparison: i) with roll, pitch and yaw compensation and ii) without compensation.

Applied VO
In order to estimate the camera displacement ∆Cam N in each frame, we adopted a simple VO with rotation estimation by Nister's 5-point algorithm [17] and translation by the FSM using features in a ground region close to the camera (details in the Appendix section), similarly to [12]. The parameters (FOV u , FOV v , W, V, H, etc) necessary for the experiments were adopted according to the provided by the KITTI. The camera inclination in relation to the ground was not directly provided, but we estimated that α = 1.1 o using the provided velodyne data. The feature extractor applied was FAST [18]. The features were automatically extracted when their number was bellow 1500 features. In order to evaluate the proposed monocular height estimation method, only the left images of the grayscale camera of the KITTI were used.

Evaluation Criteria
The evaluation was based on the error (ε N ) between the ground truth (h GT ,N ) and the estimated height (h * N ) in each frame N, according to (25). The ground truth adopted was mainly the height provided by the velodyne, available in the KITTI. Further details can be found in the Appendix section.
We also adopted a criterion to consider if the height estimation in a frame failed or not. For example, since we considered only objects above the ground surface, the estimated heights must be higher than 0 m (i.e., h min = 0). Furthermore, since the FSM considers objects below the horizon line, all objects used in the experiment should have maximum height equal to the camera height (h max = H). All estimated heights outside this maximum and minimum were considered as failure, as detailed in (26).

Results
Experiments were conducted with 10 video sequences of the KITTI, which contained many static objects (parked cars, poles, fences, houses, people, boxes, etc) in the scene. In total, height was estimated 4561 times with 436 objects. Objects with height 0 ≤ h ≤ H and distance from 4 to 31 m from the camera were used. The obtained results are summarized in Table 2, Table  3 and Figure 6.     Table 2 shows the 10 data and corresponding video sequence belonging to the KITTI. The number of valid frames and objects used in each data are also displayed. The estimated heights of all objects chosen within this data are presented in relation to the pixel positions (u,v) in Figure 6. We can observe that the objects were well distributed along pixel u direction, covering many possible positions during the experiments. On the other hand, due to the geometrical limitations of the FSM only pixels below the horizon line (i. e., v > 200 pixels) were used, but we can also observe objects distributed along this interval. Next, the used data is displayed in relation to σ p,N , σ y,N and σ r,N . Variations of roll (σ r,N ) and pitch (σ p,N ) in the used data were smaller than those of yaw (σ y,N ): while the magnitude of roll and pitch variations were within 0.03 rad, the magnitude of the yaw variations were over 0.05 rad. The average error of all used data resulted in 0.20 m with the 3DOF compensations and in 0.23 m when no compensations were done. The maximum error resulted in 1.12 m for the compensated case an 1.51 m for the non-compensated one.

Discussions
Since both average and maximum errors were improved with the compensations, we can affirm that the proposed method is valid and effective. These cases of maximum errors are illustrated in Figure 7. In (a), the case of maximum error with compensations is displayed. The point was chosen too close to the horizon line, becoming very sensitive to noises and causing the high error. However, for the same case, the method without compensation failed to estimate the height. In (b), the case of maximum error without compensation occurred when the camera estimated the object height while climbing a slope. As presented in the previous sections, pitch variations cause higher errors comparing to yaw and roll, and a high error of 1.51 m was expected. Nevertheless, when the proposed compensations were applied in the same data, the estimation error dropped to 0.56 m. Furthermore, the distribution of σ p,N , σ y,N and σ r,N in the experiments ( Figure 8) shows that even though the data was taken on public roads, most of the pose variations were below 0.01 rad: 98.8% of the σ p,N , 76.6% of the σ y,N and 99.7% of the σ r,N . We can observe that the errors were higher for higher values of pitch variation σ p,N when no compensation was done, as expected by Figure 4. Even though in average the compensated and non-compensated methods had similar errors (0.20 and 0.23 m), the difference of both errors was higher for higher variations of σ p,N , σ y,N and σ r,N .
During the experiments objects with distances further than 20 m were used, what was enough to cause more than 5 m error according to Figure 4. Since the obtained errors were below this expected ones, we can affirm that the obtained results were satisfactory. Examples of cases with higher yaw variations are shown in the Appendix section. We can observe a significant difference of the proposed method in terms of number of successful height estimations. During the experiments, the compensated method failed to estimate height 65 times, while the non-compensated one failed 283 times (4 times more). Such failures require further analysis in future work, but the effectiveness of the proposed compensations became clear. Although this work relied on VO, it didn't focus on improving its accuracy. However, we estimated that the applied VO had around 13% error per frame and influenced directly the results, what means that the proposed method becomes more accurate with the improvement of VO itself. Such VO errors led to wrong A. M. Kaneko et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 3, 1443-1452(2017 estimation of camera pose, generating higher height estimation errors and examples are shown in the Appendix section. Even though the proposed method improved the average and maximum error of the estimated heights, some limitations still exist. The small camera pose variation per frame suggests that further evaluation with higher variations is necessary, by for example, increasing the moving body's velocity and analyzing the relation between camera frame rate and obtained height estimation errors. Finally, the experiments made evident another necessary correction: we considered angular variations (rotation) during the proposed compensations, however translations in Y also occurred. According to the FSM, such translations change the camera height H and must be considered when computing the distance to objects. We believe that this consideration can further increase the accuracy of our height estimation method.

Conclusions and Future Work
A novel method of monocular height estimation with 3DOF compensation of roll, pitch and yaw variations was proposed. The method can estimate height of objects with only two frames of a monocular camera. Experiments outdoors with the KITTI benchmark data (4997 frames and 436 objects) resulted in improved accuracy of the estimated heights from a maximum error of 1.51 m to 1.12 m and reduced number of estimation failures by 4 times, proving the validity and effectiveness of the proposed method. This method can be enhanced by improving monocular visual odometry techniques and considering translational variations of the camera during height estimation. Further investigation about influences of frame rate, moving velocity and robustness are planned in the future.

Appendices
In this section, we provide further details of the experimental conditions and examples of compensations.

A Experimental Conditions
We further explain the experimental conditions adopted in the paper. The experiments were conducted according to Figure 9. The algorithm executed VO automatically for each frame (a), which were shifted manually by a human operator. Among all the extracted feature points, the operator chose with the mouse any point on a desired object (b). Here, since the FSM has geometrical restrictions, only points below the horizon (approximately in v = V 2 ) were chosen. The chosen point (in red) was tracked over the frames until the last frame possible (c), and its height and ground truth were computed and stored in each frame. The estimated position, stored data and extracted points of the current object were reseted after the heights were estimated (d) and the process repeated for all used data. First, all estimated heights and relevant data were saved and later verified for validity. Frames whose ground region contained few or no features (due to light conditions for example) or had a moving object were considered invalid ( Figure 10). Tracking was also verified (Figure 11). Although we needed only two frames per estimation, we focused on features consistent in many frames (a minimum of 3 estimations per object). Thus, in case the tracking failed before the fourth frame, that object and estimations were considered invalid. If the tracking failed after the fourth frame, only the estimations before that were considered valid.
dark region (few/none extracted features) interference from a moving object We adopted as ground truth the height from the velodyne. The necessary information for converting the camera-velodyne coordinates were also provided by the KITTI. Due to limitations of the sensor resolution, we chose as ground truth the closest pixel with available velodyne data. This search was also automatically done by the algorithm in each frame. However, since the velodyne fails in some situations, we also computed the height by the disparity from the left and right images of the camera, using Semi Global Block Matching [19] in OpenCV. For each point chosen by the operator, its corresponding velodyne data was recorded. Since the camera was moving forward and approaching the object, we considered that the depth from the velodyne to the object on the next frame should become smaller than the one in the previous frame. In case it wasn't, we considered that the height estimated by the velodyne was also not consistent, and the one calculated by the stereo camera was used instead (Figure 12). In case both depths from the sensors were not smaller than the previous one, then the current estimation and frame were considered invalid. Since we considered the initial frame of each object as reference, variations in Y of the camera itself were also compensated when computing the ground A. M. Kaneko et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 3, 1443-1452(2017 truth.  Figure 12: Example of adopted ground truth and correction. Data from the stereo camera was used when the velodyne data was considered inconsistent.

B Further Examples of Compensations
Even though there were few occurrences of yaw variations higher than 0.01 rad per frame during the experiments, we can further observe the benefits of the compensations in the example in Figure 13, which exemplifies the estimated heights during a curve. In the first frames, the camera was mostly moving forward so that ε N was small for both compensated and not compensated cases. When the yaw variation started to increase the errors also increased. However, in frame 10 we can see ε N decreasing to a minimum due to the decrease of φ y,N , increasing again when φ y,N started increasing, as foreseen by Figure 4. Figure 14 presents other examples of compensation. The cases in Figure 14 (iii) to (vi) had smaller average errors per frame after the compensations. On the other hand, even though the proposed method im-proved the average and maximum error of the estimated heights, some limitations as in Figure 14 (i) and (ii) still exist. We believe that such errors were caused by the errors of the adopted VO itself (13 % error) and improvements will be further investigated in future work. www.astesj.com