Modified 2017-12-11 by tanij
Here we briefly describe the theory behind the model in the visual odometry project. The discussion begins with a review of epipolar geometry and a description of the depth image-based rendering problem, then moves to the description of the deep learning model used.
Modified 2018-04-29 by Andrea Censi
We follow the discussion in Sun et al. (2010). First, consider the stereo setup and recall the relationship between a world point
\begin{align} m1 = \frac{1}{Z} K1 \cdot R1 [I| -C1]M \\ m2 = \frac{1}{Z} K2 \cdot R2 [I| -C2]M \end{align}
If we choose the left camera as the reference, we can set $R1=I, C2=0$ in order to get: \begin{align} m1 = \frac{1}{Z}K1 \cdot [I|0] M \\ m2 = \frac{1}{Z}K2 \cdot R[I|-C]M \end{align}
and then we can get the relationship with depth $Z$:
\begin{align} Zm2 = ZK2 R K1^{-1}m1 + K2 C \end{align}
Using this relationship, we can learn a prediction for $Z$ and use image $m1$ to predict image $m2$–we describe the experiments below.
Modified 2017-12-11 by tanij
We consider an end-to-end CNN-based unsupervised learning system for depth estimation, using the paper Unsupervised Learning of Depth and Ego-Motion from Video by Zhou et al., CVPR 2017. This model was initially trained on the KITTI dataset, and we take the pre-trained weights and evaluate them in Duckietown, showing that the results can be significantly improved by fine-tuning with Duckietown images. Our goal is real-time inference, and at test time we only use the depth prediction network:
The model consists of two networks–a pose estimation network giving us $R,t$ between the source and target frames, and a depth prediction network that gives us $Z$ and allows us to warp the source view to the target view using the pose and the RGB values in the source image. Over time, the depth prediction network begins to predict reasonable depth values. The result of fine-tuning on Duckietown data plus KITTI pretraining versus applying trained model on the KITTI dataset directly:
When the bot turns, motion blur heavily affects the model predictions. One way to alleviate this problem would be to preprocess input images. First deblur them and then feed them to the depth prediction network. Having blurred images in the training set would also slightly improve the results:
Modified 2017-12-11 by tanij
We benchmark our results against the true distance from the camera we get from April tag detection. In the figure below, we show the predicted depth values versus the estimated depth:
Outliers at low depths are due to lack of texture around April tags. As we can see with proper scaling our estimated depth is well aligned with the estimated depth from April tags–using a sparse set of true depth maps, we can rescale our pixelwise prediction into metric units.
Modified 2017-12-11 by tanij
We demonstrated a monocular depth estimation pipeline, trained with no annotated data. The approach gives reasonable depth predictions in Duckietown, but we found several notable limitations. The results on motion-blurred frames are poor, even after fine-tuning with a small number of blurred images–for good results in fast-moving environments, we would likely need to train with more blur. In addition, our approach does not incorporate any traditional depth post-processing, which should significantly improve results.
Our depth prediction node could be used for a variety of tasks, including point-cloud-based SLAM as well as obstacle detection.
Modified 2017-12-09 by Igor Vasiljevic
Unsupervised Learning of Depth and Ego-motion from Video
Maintainer: Igor
Igor
No questions found. You can ask a question on the website.