build details

Show: section status errors & todos local changes recent changes last change in-page changes feedback controls

Real-time object detection: Project report

Modified 2021-01-01 by Dishank Bansal

Authors :

Dishank BANSAL
Bhavya PATWA

The final result

Modified 2021-01-01 by Dishank Bansal

Let’s start from a teaser.

The video is at

Object Detection on Simulator

The video is at

Object Detection on Real Duckie

The instructions and code to get the results mentioned in this report can be found on project github repository.

Mission, Scope and motivation

Modified 2021-01-01 by Dishank Bansal

Perception is a key component for any autonomous system. State-of-the-art autonomous driving technologies have object detection as part of their perception method. Unfortunately, benefits of object detection were never fully leveraged for Duckietown as this method runs into one big obstacle: real-time performance. Real time performance is crucial for robotics and as object detection is quite computationally expensive, its performance on the duckiebot is limited. The goal of this project is to propose a method for real-time object detection and tracking that can be run on a duckiebot with acceptable performance using a Jetson Nano.

Existing solution

Modified 2021-01-01 by Dishank Bansal

For this work, and with this difficult context, we were not able to build our own dataset to train our different models. Fortunately, a previous project built a whole dataset for to implement object detection in Duckietown. We also found this project that aimed at implementing an object detector in DuckieTown. Unfortunately, this project seems to be deprecated, so we did not use it. Finally, we used the exercise 3 structure to implement our final object detector.


Modified 2021-01-01 by Dishank Bansal

As stated before, no object detectors were implemented in the DuckieTown pipeline for one good reason : it could not run on the Raspberry Pi in real time. This year, we were lucky to also have a Jetson Nano that has a more powerful GPU. We therefore decided to try and implement an object detector that : - run faster with Tracking while maintaining a good accuracy, - can run on the Jetson Nano at reasonable speed.

To do this, we compared and used different objection detection methods to find out which offers the best compromise between performance and accuracy in the DuckieTown setting. To increase the speed of the object detection, we also used tracking between two object detections. With the final OD real-time pipeline, we tried to implement a way to avoid detected obstacles.

Background and Preliminaries

Modified 2021-01-01 by Dishank Bansal

In this report a few preliminary knowledge is needed.

The main mathematical difficulty that the reader can encounter is the Kalman Filter. Kalman filter is used in this project in the Tracking step. The main idea of the Kalman filter is that, given a model of evolution of our state, its noise model and the measurement and noise measurement model of our system, we can firstly predict the next step state then, with our measurement corresponding to this new step, we can update to take into account both the dynamic model and the measurement. A full lecture was given by Dr. Forbes on this subject there.

On another subject, the two neural networks that are presented are compared have two very different architectures. Indeed, there are mainly two types of object detectors. On the one hand, we have the one-stage object detectors, such a Yolo or SSD-MobileNet, which make a fixed number of predictions on grid. On the other hand, the two stages object detectors use a proposal network to find approximately objects and then use a second network to fine-tune these detections and give final predictions, such as FasterRCNN or MaskRCNN. One stage ODs tend to have faster inference time while two stages ODs tend to have higher mean average precision. This article explains quite thoroughly the differences and similarities between the two architectures.

Object detection models : FasterRCNN vs. YOLOv5

Modified 2021-01-01 by Dishank Bansal

FasterRCNN architecture and performance

Modified 2021-01-01 by Dishank Bansal

As mentioned above, FasterRCNN is a two stage detector. The first stage is called the RPN (Region Proposal Network), it processes the images by a feature extractor and keep only the topmost feature maps to to predict bounding box proposals. The second stage then crop features from the topmost feature maps using these bounding box proposals. The cropped features are then passed on to the FastRCNN for bounding box regression and classification.

As its name suggests, FastRCNN is a faster version of R-CNN. Its architecture is presented in figure 3.1.

In figure 3,2, you can see the 2 stages mentioned above and the FastRCNN module.

Fast-RCNN architecture (source :

Faster-RCNN architecture (source :

Yolo architecture and performance

Modified 2021-01-01 by Dishank Bansal

YOLOv5 is a one stage object detector, like any one stage detector, it is made of three main parts :

  • model backbone
  • model neck
  • model head

The model backbone is used in object detection to extract the most important features : the richest and most distinctive ones. In YOLOv5, the backbone used is CSPNet which stands for Cross Stage Partial Networks.

Model neck is used in object detectors to build feature pyramids in order to detect an object of different sizes and scales. There are many different feature pyramid techniques available. YOLOv5 uses PANet, which stands for Path Aggregation Network.

The YOLOv5 model head is the same as in the previous version of Yolo.

Figure 3.3 gives an overall representation of YOLOv5 architecture.

YOLOv5 architecture (source :


Modified 2021-01-01 by Dishank Bansal


Modified 2021-01-01 by Dishank Bansal

Now that we have found the detector that provides the best compromise between performance and accuracy, we wanted to be able to speed up the detection process by tracking the detected bounding boxes between two detections in order to be able to skip frames in our object detector.

Moreover, Tracking can help to recover dropped detection for in-between frames.

Kalman filter

Modified 2021-01-01 by Dishank Bansal

To track a bounding over frames, we will be working in pixel space. For tracking, we assume here that the bounding boxes moves at constant speed in pixel space.

We will use a Kalman filter to track our bounding boxes. Let \mathrm{X_k} be the state vector that represents the bounding boxe coordinates and their velocities.

\begin{equation} \mathrm{X_k} = [x_1, y_1, x_2, y_2, v_{x,1}, v_{y,1}, v_{x,2}, v_{x,2}] \end{equation}

The motion model of the system that will be used for prediction is quite simple (as velocity is assumed constant):

\begin{equation} \mathrm{F}=\left[\begin{array}{cccccccc} 1 & 0 & 0 & 0 & d t & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & d t & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & d t & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & d t \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{array}\right] \end{equation}

The bounding box detection by object detector is used as measurement in the update step. The measurement is the bounding box coordinates :

\begin{equation} \mathrm{z_k} = [x_1, y_1, x_2, y_2] \end{equation}

The measurement model of the system is then :

\begin{equation} \mathrm{H}=\left[\begin{array}{cccccccc} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \end{array}\right] \end{equation}

The final Kalman filter equations for bounding box tracking is :

  • Prediction step :

\begin{equation} \begin{aligned} \hat{\mathbf{x}}_{k}^{\prime} &=\mathbf{F} \hat{\mathbf{x}}_{k-1} \\ \mathbf{P}_{k}^{\prime} &=\mathbf{F} \mathbf{P}_{k-1}^{\prime} \mathbf{F}^{T}+\mathbf{Q} \end{aligned} \end{equation}

  • Update step :

\begin{equation} \begin{aligned} \hat{\mathbf{x}}_{k} &=\hat{\mathbf{x}}_{k}^{\prime}+\mathbf{K}_{k}\left(\mathbf{z}_{k}-\mathbf{H} \hat{\mathbf{x}}_{k}^{\prime}\right) \\ \mathbf{P}_{k} &=\left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}\right) \mathbf{P}_{k}^{\prime} \\ \mathbf{K}_{k} &=\mathbf{P}_{k}^{\prime} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{P}_{k}^{\prime} \mathbf{H}^{T}+\mathbf{R}_{\mathbf{k}}\right)^{-1} \end{aligned} \end{equation}

Hungarian filter

Modified 2021-01-01 by Dishank Bansal

In the previous section, we only talk about one bounding box that we track through frames.

In reality it is more complex : when there are multiple detections (i.e multiple bounding boxes), how to know which observation associate with which prediction at the update step ?

A solution is to use the Hungarian algorithm for Data association.

Let there be N predicted bounding boxes and M observations (detected bounding boxes). The Hungarian Algorithm will match the N boxes to N observations among the M possible so that the solution is optimal over a given metric. Here, the metric used is IoU, which stands for Intersection over Union. It is computed using this formula :

\begin{equation} IoU = \frac{Area\; of \; Overlap}{Area\; of \; Union} \end{equation}

Object avoidance

Modified 2021-01-01 by Dishank Bansal

In this section, we will see how we can use the duckiebot and duckie detector to implement certain behaviours in order to make DuckieTown safe again. We had two main goal behaviours :

  • Stopping in front of an obstacle
  • Overtaking an obstacle

Compute obstacle position in lane

Modified 2021-01-01 by Dishank Bansal

First of all, we need to know, from the bounding boxes detected by our detector, the position of the obstacles. This position will be used to pass two information to the lane controller :

  • Is there an obstacle close enough in our lane ?
  • Is there a close obstacle in the other (left) lane ?

Let’s take a bounding box. The coordinate of this box is given in pixels in the distorted image (due to the camera lens). First, we need to compute the center of the obstacle on the ground. Then, we need to rectify the center coordinates so that it corresponds to the rectified image. Since the point is considered on the ground (low edge of the box), we can use the GroundProjection module (used for line detection) to estimate the real coordinates from the duckiebot’s origin. Then, with the duckiebot’s lane pose, we can compute the obstacle lane pose using this formula : \begin{equation} pose_y = \cos(\phi) (y + d) + \sin(\phi) x \end{equation}

A diagram given in figure 3.4 illustrates the situation, the grey rectangle being the duckiebot and the yellow cross an obstacle.

Obstacle position diagram

In the figure 3.5, the flowchart to make the decision is detailed.

Obstacle position flowchart

Stopping in front of an obstacle

Modified 2021-01-01 by Dishank Bansal

The first behaviour is quite straight forward : if an obstacle (duckiebot of duckie) is detected in our lane close enough from the bot, the lane controller passes v = 0 to the wheel command. Figure 3.6 details the algorithm used to stop in front of an obstacle.

Algorithm to stop in front of an obstacle

Overtaking an obstacle

Modified 2021-01-01 by Dishank Bansal

Here the problem is more challenging : avoiding an obstacle. In the literature, obstacle avoidance is well researched and most of the solutions proposed use graphs in which the vehicle must find the shortest path while respecting constraints or path planning.

In DuckieTown, we thought it would be simpler to overtake the obstacle by switching lane.

Our solution is to change the d_off parameter used in the Lane Controller Node to make the duckiebot believe it is not in the right lane.

First, we need to increase the d_off parameter so that the bot moves to the left lane, then keep it increased while it passes the obstacle and finally decrease it to switch back to the right lane.

The decision process to know when to overtake is detailed in figure 3.5. In figure 3.7, the flowchart to overtake the obstacle is explained. The corresponding algorithm is also detailed in figure 3.8.

Obstacle overtaking flowchart
Algorithm to overtake an obstacle

Formal performance evaluation / Results

Modified 2021-01-01 by Dishank Bansal

We compared the different detectors using different processors to assess which will perform best on the duckiebot.

Here are some specifications regarding the material used to obtain the metrics provided above :

  • CPU : AMD Ryzen Threadripper 1950X 16-Core Processor
  • GPU : GeForce RTX 2080 Ti, 11 GB
  • RAM : 32 GB
  • Jetson Nano : specifications cqn be found here

The metrics used to assess the object detector’s performance are FPS (Frames Per Second) and mAP (mean Average Precision). The first one measures the detector’s speed and the second one its accuracy.

We would like to remind the reader here that the following numbers are without Tracking and that FPS can be easily doubled if we do measurement after every one image and use prediction for in between frames.

Here is the performance of FasterRCNN with two different backbones : Resnet50 and Resnet18. Both were tested using the DuckieTown gym mentioned [above]{#real-time-object-detection-final-literature}.

  • Using the Resnet50 backbone :

    Proposals FPS (on GPU) FPS (on CPU) mAP
    300 55.5 (0.018s) 1.8 (0.55s) 83.9%
    50 77 (0.013s) 2.6 (0.38s) 83.8%
    10 77 (0.013s) 2.7 (0.36s) 74.3%
  • Using Resnet18 :

    Proposals FPS (on GPU) FPS (on CPU) mAP
    300 111 (0.009s) 5.55 (0.18s) 86.472%
    50 142 (0.007s) 7.69 (0.13s) 86.462%

YOLOv5 has been tested on high and low-resolution images.

Resolution FPS (on GPU) FPS (on CPU) FPS (on Jetson Nano) mAP
640x480 110 (0.009s) 9.6 (0.104s) 5 (0.200s) 71.14%
320x240 113 (0.009s) 19.5 (0.051s) 10 (0.100s) 68.56%

Hence, we achieve about real-time performance of 20 FPS on Jetson Nano using DL based object detector and combining it with tracking

Future avenues of development

Modified 2021-01-01 by Dishank Bansal

Making Docker image with Object Detection that can be run on Jetson Nano.