build details

Show: section status errors & todos local changes recent changes last change in-page changes feedback controls

The Saviors: Final Report

Modified 2018-05-27 by Andrea Censi

TODO for Jacopo Tani: fix math formatting, video aspect ratio, standardize apperance, distribute “preliminaries” contributions in here appropriately throughout the book 

previous task next (5 of 24) index
for:Jacopo Tanitask

The following was marked as "todo".

TODO for Jacopo Tani: fix math formatting, video aspect ratio, standardize apperance, distribute “preliminaries” contributions in here appropriately throughout the book 

File book/fall2017_projects/15_saviors/

File book/fall2017_projects/15_saviors/
in repo duckietown/docs-fall2017_projects branch master17 commit 2fcca8c7
last modified by Andrea Censi on 2018-09-02 16:46:23

Created by function create_notes_from_elements in module mcdp_docs.task_markers.

This is the final report of the fall 2017 Saviors group from ETH Zurich, namely Fabio Meier (, Julian Nubert (, Fabrice Oehler ( and Niklas Funk ( We enjoyed contributing to this great project and in case there are any open questions left after having read this report, do not hesitate to contact us.

The Final Result

Modified 2018-06-25 by Andrea Censi

The Saviors Teaser:


See the operation manual to reproduce these results.

The code description can be found here in the Readme.

Mission and Scope

Modified 2018-02-18 by fabitosh


The goal of Duckietown is to provide a relatively simple platform to explore, tackle and solve many problems linked to autonomous driving. “Duckietown” is simple in the basics, but an infinitely expandable environment. From controlling single driving Duckiebots until complete fleet management, every scenario is possible and can be put into practice. Due to the previous classes and also the great work of many volunteers, many software packages were already developed and provided a solid basis. But something was still missing.


Modified 2018-02-19 by fabitosh

So far, none of the mentioned modules was capable of reliably detecting obstacles and reacting to them in real time. We were not the only ones who saw a problem in the situation at that time: “Ensuring safety is a paramount concern for the citizens of Duckietown. The city has therefore commissioned the implementation of protocols for guaranteed obstacle detection and avoidance.”[4]. Therefore the foundation of our complete module lies in the disposal of this shortcoming. Finding a good solution for this safety related and very important topic helped us to stay motivated every day we were trying to improve our solution.

The goal of our module is to detect obstacles and react accordingly. Due to the limited amount of time, we focused the scope of our module to two points:

1. In terms of detection, on the one hand we focused to reliably detect yellow duckies and therefore to saving the little duckies that want to cross the road. On the other hand we had to detect orange cones to not crash into any construction site in Duckietown.

2. In terms of reacting to the detected obstacles we were mainly restricted by the constraint given by the controllers of our Duckiebots, who do not allow us to cross the middle of the road. This eliminated the need of also having to implement a Duckiebot detection algorithm. So we focused on writing software which tries to avoid obstacles within our own lane if it is possible (e.g. for avoiding cones on the side of the lane) and to stop otherwise.

Besides aforementioned restrictions and simplifications we faced the general problem of detecting obstacles given images from a monocular RGB camera mounted at the front of our Duckiebot and reacting to them properly without crashing or erroneously stopping the Duckiebot. Both processes above have to be implemented and have to run on the Raspberry Pis in real time. Due to the strong hardware limitations, we decided to not use any learning algorithms for the obstacle detection part. As it later transpired, a working “hard coded” software needs thorough analysis and understanding of the given problem. However, in the future, considering additional hardware like e.g. Tung, “Google offers Raspberry Pi owners this new AI vision kit” (2017), this decision might have to be reevaluated.

In practice a well working obstacle detection is one of the most important parts of an autonomous system to improve the reliability of the outcome even in unexpected situations. Therefore the relevance of an obstacle detection in a framework like “Duckietown” is very important. Especially because the aim of “Duckietown” is to simulate the real world as realistic as possible and also in other topics such as fleet planning, a system with obstacle detection behaves completely different than a system without.

Existing Solution

Modified 2018-02-19 by fabitosh

There was a previous implementation from the MIT classes in 2016. Of course we had a look into the old software and found out that one step of them was quite similar to ours: They based their obstacle detection on the colors of the obstacles. Therefore they also did their processing in the HSV color space as we did. Further information on why filtering colors in the HSV space is advantageous can be found in the Theory Chapter.

Nevertheless, we implemted our solution from scratch and didn’t base ours on any further concepts found in their software. That is why you won’t find any further similarites between the two implementations. The reasons for implementing our own code from scratch can be found in the next section Opprtunity. In short, last year’s solution considered the image given the original camera’s perspective and tried to classify the objects based on their contour. We are using a very different approach concerning those two crucial parts as you can see in the Contribution section.


Modified 2018-02-18 by Niklas Funk

From the beginning it was quite clear that the old software was not working reliable enough. The information we have been given was that it was far off detecting all obstacles and that there were quite a few false positives: It detected yellow line segments in the middle of the road as obstacles (color and size are quite similar to the ones of typical duckies) which led to a stopping of the car. Furthermore, extracting the contour of every potential obstacle is highly computationally expensive. As mentioned, we had a look into the software and tried to understand it as well as possible but because it was not documented at all we couldn’t go much into detail. On top of that, from the very beginning we had a completely different idea of how we wanted to tackle these challenges.

We also tried to start their software but we couldn’t make it run after a significant amount of time. The readme file didn’t contain any information and the rest of the software was not documented as well. This also reinforced us in our decision to write our own implementation from scratch.


Modified 2018-06-25 by Andrea Censi

Since our task was to reliably detect obstacles using a monocular camera only, we mainly dealt with processing the camera image, extracting the needed information, visualizing the results and to act accordingly in the real world.

For understanding our approach we tried to explain and summarize the needed concepts in the theory chapter (see section Theory Chapter). There you will find all the references to the relevant sources.

Definition of the Problem

Modified 2018-02-19 by Niklas Funk

In this chapter we try to explain our problem in a more scientific way and want to show all needed steps to fullfill the superordinate functionality of ”avoiding obstacles”.

The only input is a RGB colored image, taken by a monocular camera (only one camera). The input image could look as Figure 3.2.

Sample image including some obstacles

With this information given, we want to find out whether an obstacle is in our way or not. If so, we want to either stop or adapt the trajectory to pass without crashing into the obstacle. This information is then forwarded as an output to the controllers who will process out commands and try to act accordingly.

Therefore one of the first very important decisions was to separate the detection and reaction parts of our saviors pipeline. This decision allowed us to divide our work efficiently, to start right away and is supposed to ensure a wide range of flexibility in the future by making it possible to easily replace, optimize or work on one of the parts separately (either the obstacle avoidance strategies or obstacle detection algorithms). Of course it also includes having to define a clear, reasonable interface in between the two modules, which will later be explained in detail.

You can have a look in our Preliminary Design Document and Intermediate Report to see how we defined the following topics in the beginning: The problem statement, our final objective, the underlying assumptions we lean on and the performance measurement to quantitatively check the perfomance of our algorithms. For the most part, it worked out to adhere to this document but for sake of completeness we will shortly repeat them again in the following for each of the two submodules.

Part 1: Computer Vision - Description

Modified 2018-02-19 by fabitosh

In principle we wanted to use the camera image only to reach the following:

  1. Detect the obstacles in the camera image
  2. Viusalize them in the camera image for tuning parameters and optimizing the code
  3. Give the 3D coordinates of every detected obstacle in the real world
  4. Give the size of every detected obstacle in the form of a radius around the 3D coordinate
  5. Label each obstacle if it’s inside or outside the lane boundaries (e.g. for the purpose of not stopping in a curve)
  6. Visualize them as markers in the 3D world (rviz)

Since every algorithm has its limitations, we made the following assumptions:

  • Obstacles are only yellow duckies and orange cones
  • Calibrated camera including intrinsics and extrinsics

Those assumptions changed slightly since the Preliminary Design Document because we are now also able to detect duckies on the middle line and in intersections.

It was our aim to reach the maximum within these specified limits. Therefore our goal was not only the detection and visualization in general but we also wanted to reach a maximum in robustness with respect to changes in:

  • Obstacle size
  • Obstacle color (within orange, and yellow to detect different traffic cones and duckies)
  • Illumination

For evaluating the performance, we used the following metrics, evaluated under different light conditions and different velocities (static and in motion):

  • Percentage of correctly classified obstacles on our picture datasets
  • Percentage of false positives
  • Percentage of missed obstacles

An evaluation of our goals and the reached performance can be found in the Performance Evaluation section.

Our approach is simply based on analysing incoming pictures for obstacles and trying to track them to make the algorithm more robust against outliers. Since we only rely on the monocular camera, we do not have any direct depth information given. In theory, it would be possible to estimate the depth of each pixel through some monocular visual odometry algorithm considering multiple consecutive images. However this would be extremely computationally expensive. The large amount of motion blur in our setup, a missing IMU (for estimating the absolute scale) further argue against such an approach. In our approach we use the extrinsic calibration to estimate the position of the given obstacles. The intuition behind that is that it is possible to assume that all pixels seen from the camera belong to the ground plane (except for obstacles which stand out of it) and that the Duckikebot’s relative position to this ground plane stays constant. Therefore you can assign a real world 3D coordinate to every pixel seen with the camera. For more details refer to the section below.

The final output is supposed to look as Figure 3.4.

Final output image including visualization of detected obstacles

Part 2: Avoidance in Real World - Description

Modified 2018-02-19 by fabitosh

With the from Part 1 given 3D position, size and the labelling whether the object is inside the lane boundaries or not, we wanted to reach the final objectives:

  1. Plan path around obstacle if possile (we have to stay within our lane)
  2. If this is not possible, simply stop

The assumptions for correctly reacting to the previously detected obstacles are:

  • Heading and position relative to track given
  • “The Controllers” are responsible for following our trajectory
  • Possibility to influence vehicle speed (slow down, stop)

As we now know, the first assumption is normally not fulfilled. We describe in the functionality section why this comes out to be a problem.

For measuring the performance we used:

  • Avoid/hit ratio
  • Also performed during changing light conditions

Contribution / Added Functionality

Modified 2018-02-18 by Niklas Funk

Software Architecture

Modified 2018-02-18 by Niklas Funk

In general we have four interfaces which had to be created throughout the implementation of our software:

1. At first, we need to recieve an incoming picture which we want to analyse. As our chosen approach includes filtering for specific colors, we are obviously dependent on the lighting conditions. In a first stage of our project, we nevertheless simply subscribed to the raw camera image because of the considerable expense of integrating the Anti Instagram Color Transformation and since the Anti Instagram team also first had to further develop their algorithms. During our tests we quickly recognized that our color filtering based approach would always have some troubles if we don’t compensate for the lighting change. Therefore, in the second part of the project we closely collaborated with the Anti Instagram team and are now subscribing to a color corrected image provided by them. Currently, to keep computational power on our Raspberry Pi low, the corrected image is published at 4Hz only and the color transformation needs at most 0.2 seconds.

2. The second part of our System Integration is the internal interface between the object detection and avoidance part. The interface is defined as a PoseArray which has the same timestamp as the picture from which the obstacles have been extracted. This Array, as the name already describes, is made up of single poses. The meaning of those are the following:

The position x and y describe the real world position of the obstacle which is in our case the center front coordinate of the obstacle. Since we assume planarity, the z coordinate of the position is not needed. That is why we are using this z coordinate to describe the radius of the obstacle.

Furthermore a negative z coordinate shows that there is a white line in between us and the obstacle which indicates that it is not dangerous to us since we assume to always having to stay in the lane boundaries. Therefore this information allows us to not stop if there is an obstacle behind a turn.

As for the scope of our project, the orientation of the obstacles is not really important, we use the remaining four elements of the Pose Message to pass the pixel coordinates of the bounding box of the obstacle seen in the bird view. This is not needed for our “Reaction” module but allows us to implement an efficient way of visualisation which will be later described in detail. Furthermore, we expect our obstacle detection module to add an additional delay of about max. 0.3s.

3. The third part is the interface between our obstacle avoidance node and the Controllers. The obstacle avoidance node generates an obstacle avoidance pose array and obstacle avoidance active flag.

The obstacle avoidance pose array is the main interface between the Saviors and the group doing lane control. We use the pose array to transmit d_ref (target distance to middle of the lane) and v_ref (target robot speed). The d_ref is our main control output which enables us to position the robot inside the lane and therefore to avoid objects which are placed close the laneline on the track. Furthermore v_ref is used to stop the robot when there is an unavoidable object by setting the target speed to zero.

The flag is used to communicate to the lane control nodes when the obstacle avoidance is activated which then triggers d_ref and v_ref tracking.

4. The fourth part is an optional interface between the Duckiebot and the user’s personal Laptop. Especially for the needs of debugging and infering what is going on, we decided to implement a visualisation node which can visualize on the one hand the input image including bounding boxes around all the objects which were classified as obstacles and furthermore this node can output the obstacles as markers which can be displayed in rviz.

In the following (Figure 3.6) you find a graph which summarises our software packages and gives a brief overview.

Module overview 'The Saviors'

Part 1: Computer Vision - Functionality

Modified 2018-02-18 by Niklas Funk

Let’s again have a look on the usual incoming camera picture in Figure 3.2.

In the very beginning of the project, like the previous implementation in 2016, we tried to do the detection in the normal camera image but we tried to optimize for more efficient and general obstacle descriptors. Due to the specifications of a normal camera, lines which are parallel in the real world are in general not parallel any longer and so the size and shape of the obstacles are disturbed (elements of the same size appear also larger in the front than in the back). This made it very difficult to reliably differentiate between yellow ducks and line segments. We tried several different approaches to overcome this problem, namely:

  • Patch matching of duckies viewed from different directions
  • Patch matching with some kind of an ellipse (because line segments are supposed to be square)
  • Measuring the maximal diameter
  • Comparing the height and the width of the objects
  • Taking the pixel volume of the duckies

Unfortunately none of the described approaches provided a sufficient performance. Also a combination of them didn’t make the desired impact. All metrices which are somehow associated with the size of the object just won’t work because duckies further away from the duckiebot are simply a lot smaller than the one very close to the Duckiebot. All metrices associated with the “squareness” of the lines were strongly disturbed by the ocurring motion blur. This makes finding a general criterion very difficult and made us think about changing the approach entirely.

Therefore we developed and came up with the following new approach!

Theoretical Description

In our setup, through the extrinsic camera calibration, we are given a mapping from each pixel in the camera frame to a corresponding real world coordinate. It is important to mention that this transformation assumes all seen pixels in the camera frame to lie in one plane which is in our case the ground plane/street. Our approach exactly exploits this fact by transforming the given camera image into a new, bird’s view perspective which basically shows one and the same scene from above. Therefore the information provided by the extrinsic calibration is essential for our algorithm to work properly. In Figure 3.8 you can see the newly warped image seen from the bird’s view perspective. This is one of the most important steps in our algorithm.

Image now seen from the bird's view perspective

This approach has already been shown by Prof. Davide Scaramuzza (UZH) and some other papers and is referred as Inverse Perspective Mapping Algorithm. (see: [5],[6],[7])

What stands out, is that the lines which are parallel in the real world are also parallel in this view. Generally in this “bird’s” view, all objects which really belong to the ground plane are represented by their real shape (e.g. the line segments are exact rectangles) while all the objects which are not on the ground plane (namely our obstacles) are heavily disturbed in this top view. This top view is roughly keeping the size of the elements on the ground whereas the obstacles are displayed a lot larger.

The theory behind the calculations and why the objects are so heavily distorted can be found in the Theory Chapter.

Either way, we take advantage of this property. Given this bird’s view perspective, we still have to extract the obstacles from it. To achieve this extraction, we first filter out everything except for orange and yellow elements, since we assumed that we only want to detect yellow duckies and orange cones. To simplify this step significantly, we transform the obtained color corrected images (provided by the Anti Instagram module) to the HSV color space. We use this HSV color space and not the RGB space because it is much easier to account for slightly different illuminations - which of course still exist since the performance of the color correction is logically not perfect - in the HSV room compared to RGB. For the theory behind the HSV space, please refer to our appropriate Theory Chapter.

After this first color filtering process, there are only objects remaining which have approximately the colors of the expected obstacles. For the purpose of filtering out the real obstacles from the bunch of all the remaining objects which passed the color filter, we decided to do the following: We segment the image of the remaining objects, i.e. all connected pixels in the filtered image are getting the same label such that you can later analyse the objects one by one. Each number then represents an obstacle. For the process of segmentation, we used the following algorithm. (see [8])

Given the isolated objects, the task remains to finally decide which objects are considered obstacles and which not. In a first stage, there is a filter criterion based on a rotation invariant feature, namely the two eigenvalues of the inertia_tensor of the segmented region when rotating around its center of mass. (see [9])

In a second stage, we apply a tracking algorithm to reject the remaining outliers and decrease the likelihood for misclassifications. The tracker especially aims for objects which passed the first stage’s criterion by a small margin.

For further informations and details about how we perform the needed operations, please refer to the next chapter.

The final output of the detection module is the one we showed in Figure 3.4.

Actual Implementation

Now we want to go more into detail how we implemented the described steps.

In the beginning we again start from the picture you can see in Figure 3.2. In our case this is now the corrected image coming out form the image_transformer_node and was implemented by the anti instagram group. We then perform the follwing steps:

1. In a first step we crop this picture to make our algorithm a little bit more efficient and due to our limited velocities, it makes no sense to detect obstacles which are not needed to be taken into consideration by our obstacle avoidance module. However, we do not simply crop the picture by a fixed amount of pixels, but we use the extrinsic calibration to neglect all the pixels which are farther away than a user defined threshold, which is at the moment at 1.7 meters. So the amount of pixels which are neglected are different for every Duckiebot and depend on the extrinsic calibration. The resulting image can be seen in Figure 3.10. The calculations to find out where You have to cut the image are quite simple (note that it still bargains for homogeneous coordinates):

$$ p_{camera} = H^{-1}P_{world} $$

Cropped image

2. Directly detecting the obstacles from this cropped input image failed for us due to the reasons descibed above. That is why the second step is to perform the transformation to the bird’s view perspective. For transforming the image, we first use the corners of the cropped image and transorm it to the real world. Then we scale the real world coordinates to pixel coordinates, so that it will have a width of 640 pixels afterwards. For warping all of the remaining pixels with low artifacts we use the function cv2.getPerspectiveTransform(). The obtained image can be seen in Figure 3.8.

3. Then we transform the given RGB picture into the HSV colorspace and apply the yellow and orange filter. While a HSV image is hardly readable for humans, it is way better to filter for specific colors. The obtained pictures can be seen in Figure 3.12 and Figure 3.14. The color filter operation is performed by the cv2 function cv2.inRange(im_test, self.lower_yellow, self.upper_yellow) where lower_yellow and upper_yellow are the thresholds for yellow in the HSV color space.

Yellow filtered image
Orange filtered image

4. Now there is the task of segmenting/isolating the objects which remained after the color filtering process. At the beginning of the project we therefore implemented our own segmentation algorithm which was however more inefficient and led to an overall computational load of 200% CPU usage and a maximum frequency of our whole module of about 0.5 Hz only. By using the scikit-image module which provides a very efficient label function, the computational efficiency could be shrunk considerably to about 70% CPU usage and allows the whole module to run at up to 3 Hz. It is important to remember that in our implementation the segmentation process is the one which consumes the most power. The output after the segmentation is the one in Figure 3.16, where the different colors represent the different segmented objects.

Segmented image

5. After the segmentation, we analyse each of the objects separately. At first there is a general filter which ensures that we are neglecting all the objects which contain less than a user influenced threshold of pixels. Since as mentioned above, the homographies of all the users are different, the exact amount of pixels, an object is required to have, is again scaled by the individual homography. This is followed by a more in detail analysis which is color dependent. On the one hand there is the challenge to detect the orange cones reliably. Speaking about cones, the only other object that might be erroneously detected as orange are the stop lines. Of course, in general the goal should be to very reliably detect orange but as the light is about to change during the drive, we prepared to also detect the stop lines and being able to cope with them when they are erroneously detected. The other general challenge was that all objects that we have to detect can appear in all different orientations. Simply inferring the height and width of the segmented box, as we did it in the beginning, is obviously not a very good measure (e.g. in Figure 3.18 in the lower left the segmented box is square while the cone itself is not quadratic at all).

Bird's view with displayed obstacle boxes

That is why it is best to use a rotation invariant feature to classify the segmented object. In our final implementation we came up with using the two eigenvalues of the inertia tensor, which are obviously rotation invariant (when being ordered by their size). Being more specific about the detection of cones, when extracting the cone from Figure 3.16 it is looking like in Figure 3.20, while an erroneous detection of a stop line is looking like in Figure 3.22.

Segmented cone
Segmented stop line

Our filter criteria is now the ratio between the eigenvalues of the inertia tensor. This ratio is always by a factor of about 100 greater in case the object is a cone, compared to when we erroneously segment a red stop line. This criteria is very stable that is why there is no additional filtering needed to detect the cones.

If the segmented object is yellowish, things get a little more tricky as there are always many yellow objects in the picture, namely the middle lines. Line elements can be again observed under every possible orientation. Therefore the eigenvalues of the inertia tensor, which are as mentioned above rotation invariant, are again the way to go. In Figure 3.24 you can see a segmented line element and in Figure 3.26 again a segmented duckie.

Segmented middle line
Segmented duckie

As the labelled axis already reveal, they are of a different scale, but as we also got very small duckies, we had to choose a very small threshold. To detect the yellow duckies, the initial condition is that the first eigenvalue has to be greater than 20. This criteria alone however includes to sometimes erroneously detecting the lines as obstacles, that is why we implemented an additional tracking algorithm which works as follows: If an object’s first eigenvalue is greater than 100 pixels and it is detected twice - meaning in two consecutive images there is a object detected at roughly the same place - it is labelled as an obstacle. However, if an object is smaller or changed the size by more than 50% in the consecutive frames, then a more restrictive criteria is enforced. This more restrictive criterion states that we must have tracked this object for at least for 3 consecutive frames before being labelled as an obstacle. This criteria is working pretty well and a more thorough evaluation will be provided in the next section. In general those criteria help that the obstacles can be detected in any orientation. The only danger to the yellow detecting algorithm is motion blur, namely when the single lines are not separated but connected together by “blur”.

6. After analysing each of the potential obstacle objects, we decide whether it is an obstacle or not. If so, we continue to steps 7. and 8..

7. Afterwards, we calculate the position and radius of all of the obstacles. After segmenting the object we calculate the 4 corners (which are connected in Figure 3.28 to form the green rectangle). We defined the obstacle’s position as the midpoint of the lower line (this point surely lies on the ground plane). For the radius, we use the distance in the real world between this point and the lower right corner. This turned out to be a good approximation of the radius. For an illustration you can have a look at Figure 3.28.

Position and radius of the obstacle

8. Towards the end of the project we came up with one additional last step based on the idea that only obstacles inside the white lane boundaries are of interest to us. That is why for each obstacle, we look whether there is something white in between us and the obstacle. In Figure 3.30 you can see an example situation where the obstacle inside the lane is marked as dangerous (red) while the other one is marked as not of interest to us since it is outside the lane boundary (green). In Figure 3.32 you see the search lines (yellow) along which we search for white elements.

Classification if objects are dangerous or not
Search lines to infer if something white is in between

9. As the last step of the detection pipeline we return a list of all obstacles including all the information via the Posearray.

Part 2: Avoidance in Real World - Functionality

Modified 2018-02-18 by fabitosh

The Avoidance deals with drawing the right conclusions from the received data and forwarding it.

Theoretical Description

With the separation of the detection, an important part of the avoidance node is the interaction with the other work packages. We determined the need of getting information about the remaining Duckietown besides the detected obstacles. The obstacles need to be in relation to the track, in order to assess whether we have to stop, can drive around obstacles or if it is even already out of track. Due to other teams already working on the orientation within Duckietown, we deemed it best to not implement any further detections (lines, intersections etc.) in our visual perception pipeline. This saves similar algorithms being run twice on the processor. We decided to acquire the values of our current pose relative to the side lane, which is determined by the devel-linedetection group.

The idea was to make the system highly flexible. The option to adapt to following situations was deemed desirable:

  • Multiple obstacles. Different path planning in case of a possible avoidance might be required.
  • Adapted behavior if the robot is at intersections.
  • Collision avoidance dependent on the fleet status within the Duckietown. Meaning if a Duckiebot drives alone in a town it should have the option to avoid a collision by driving onto the opposite lane.

Obstacles sideways of the robot were expected to appear as the Duckietowns tend to be flooded by duckies. Those detections on the side as well as far away false positive detections should not make the robot stop. To prevent that, we intended on implementing a parametrized bounding box ahead of the robot. Only obstacles within that box would be considered. Depending on the certainty of the detections as well as the yaw-velocities the parametrization would be tuned.

The interface getting our computed desired values to impact the actual Duckiebot is handled by devel-controllers. We agreed on the usage of their custom message format, in which we send desired values for the lateral lane position and the longitudinal velocity. Our intention was to account for the delay of the physical system in the avoider node. Thus our planned trajectory will reach the offset earlier than the ideal-case trajectory would have to.

Due to above mentioned interfaces and multiple levels of goals we were aiming for an architecture which allows gradual commissioning. The intent was to be able to go from basic to more advanced for us as well as for groups in upcoming years. Those should be able to extend our framework and not have to rebuild it.

The logic shown in Figure 3.34 displays one of the first stages in the commissioning. Key is the reaction to the number of detected obstacles. Later stages will not trigger an emergency stop in case of multiple obstacle detections within the bounding box.

Logic of one of the First Stages in Commissioning

Our biggest concern were the added inaccuracies until the planning of the trajectory. Those include:

  • Inaccuracy of the currently determined pose
  • Inaccuracy of the obstacle detection
  • Inaccuracy of the effectively driven path aka. controller performance

To us the determination of the pose was expected to be the most critical. Our preliminary results of the obstacle detection seemed reasonably accurate. The controller could be tweaked that the robot would rather drive out of the track than into the obstacle. Though an inaccurate estimation of the pose would just widen the duckie artificially.

Devel-controllers did not plan on being able to intentionally leave the lane. Meaning the space left to avoid an obstacle on the side of the lane is tight making above uncertainties more severe.

We evaluated the option to keep track of our position inside the map. Given a decent accuracy of said position we’d be able to create a map of the detected obstacles. Afterwards - especially given multiple detections (also outside of the bounding box) - we could achieve a further estimation of our pose relative to the obstacles. This essentially would mean creating a SLAM-algorithm with obstacles as landmarks. We declared as out of scope given the size of our team as well as the computational constraints. The goal was to make use of a stable, continuous detection and in each frame react on it.

Actual Implementation


One important part of the Software is the handling of the interfaces, mainly to devel_controllers. For further informations on this you can refer to the Software Architecture Chapter.


The obstacle avoidance part of the problem is handled by an additional node, called the obstacle_avoidance_node. The node uses two main inputs which are the obstacle pose and the lane pose. The obstacle pose is an input coming from the obstacle detection node, which contains an array of all the obstacles currently detected. Each array element consists of an x and y coordinate of an obstacle in the robot frame (with the camera as origin) and the radius of the detected object. By setting the radius to a negative value, the detection node indicates that this obstacle is outside the lane and should not be considered for avoidance. The lane pose is coming from the line detection node and contains among other unused channels the current estimated distance to the middle of the lane (d) as well as the current heading of the robot $\theta$. Figure 3.36 introduces the orientations and definitions of the different inputs which are processed in the obstacle avoidance node.

Variable Definitions seen from the Top

Using the obstacle pose array we determine how many obstacles need to be considered for avoidance. If the detected obstacle is outside the lane and therefore marked with a negative radius by the obstacle detection node we can ignore it. Furthermore, we use the before mentioned bounding box with tunable size which assures that only objects in a certain range from the robot are considered. As soon as an object within limits is inside of the bounding box, the obstacle_avoidance_active flag is set to true and the algorithm already introduced in Figure 3.34 is executed.

Case 1: Obstacle Avoidance

If there is only one obstacle in range and inside the bounding box, the obstacle avoidance code in the avoider function is executed. First step of the avoider function is to transform the transmitted obstacle coordinates from the robot frame to a frame which is fixed to the middle of the lane using the estimated measurements of $\theta$ and d. Doing this transformation allows us to calculate the distance of the object from the middle line. If the remaining space (in the lane (subtracted by a safety avoidance margin) is large enough for the robot to drive through we proceed with the obstacle avoidance, if not we switch to case 2 and stop the vehicle. Please refer to Figure 3.38

Geometry of described Scene

If the transformation shows that an avoidance is possible we calculate the d_ref we need to achieve to avoid the obstacle. This is sent to the lane control node and then processed as new target distance to the middle of the lane. The lane control node uses this target and starts to correct the Duckiebot’s position in the lane. With each new obstacle pose being generated this target is adapted so that the Duckiebot eventually reaches target position. The slow transition movement allows us to avoid the obstacle even when it is not visible anymore shortly before the robot is at the same level as the obstacle.

At the current stage, the obstacle avoidance is not working due to very high inaccuracies in the estimation of $\theta$. The value shows inaccuracies with an amplitude of 10°, which leads to wrong calculations of the transformation and therefore to misjudgement of the d_ref. The high amplitude of these imprecisions could be transformed to a uncertainty factor of around 3 which means that each object is around 3 times its actual size which means that even a small obstacle on the side of the lane would not allow a safe avoidance to take place. For this stage to work, the estimation of $\theta$ would need significant improvement.

Case 2: Emergency Stop

Conditions for triggering an emergency stop:

  • More than one obstacle in range
  • Avoidance not possible because the obstacle is in the middle of the lane
  • Currently every obstacle detection in the bounding box triggers an emergency stop due to the above reasons

If one of the above scenarios occurs, an avoidance is not possible and the robot needs to be stopped. By setting the target speed to zero, the lane controller node stops the Duckiebot. As soon as the situation is resolved by removing the obstacle which triggered the emergency stop, the robot can proceed with the lane following.

These tasks are then repeated at the frame rate of the obstacle detection array being sent.

Required Infrastructure - Visualizer

Modified 2018-02-18 by Niklas Funk

Especially when dealing with a vision based obstacle detection algorithm it is very hard to infer what is going on. One has to also keep the visual outputs low, to consume as less computing power as possible, especially on the Raspberry Pi. This is why we decided to not implement one single obstacle detection node, but effectively two of them, together with some scripts which should help to tune the parameters offline and to infer the number of false positives, etc.. The node which is designed to be run on the Raspberry Pi is our normal obstacle_detection_node. This should in general be run such that there is no visual output at all but that simply the PoseArray of obstacles is published through this node.

The other node, namely the obstacle_detection_visual_node is designed to be run on your own laptop which is basically visualising the information given by the posearray. There are two visualisations available. On the one hand there is a marker visualisation in rviz which shows the position and size of the obstacles. In here all the dangerous obstacles which must be considered are shown in red, whereas the non critical (which we think that they are outside the lane boundaries) are marked in green. On the other hand there is also a visualisation available which shows the camera image together with bounding boxes around the detected obstacles. Nevertheless, this online visualisation is still dependent on the connectivity and you can only hardly “freeze” single situations where our algorithm failed. That is why we also included some helpful scripts into our package. One script allows to thoroughly input many pictures and outputs them labelled together with the bounding boxes, while another one outputs all the intermediate steps of our filtering process which allows to fastly adapt e.g. the color thresholds which is in our opinion still the major reason for failure. More information on our created scripts can be found in our Readme on GitHub.

Recorded Logs

Modified 2018-02-19 by Niklas Funk

For being able to thorougly evaluate and tune our algorithms, we recorded various bags, which we uploaded to the Duckietown logs database.

Formal Performance Evaluation / Results

Modified 2018-02-18 by Niklas Funk

Evaluation of the Interface and Computational Load

Modified 2018-02-18 by Niklas Funk

In general as we are dealing with many color filters a reasonable color corrected image is the key to the good functioning of our whole module, but turned out to be the greatest challenge when it comes down to computational efficiency and performance. As described above we are really dependent on a color corrected image by the Anti Instagram module. Throughout the whole project we planned to use their continuous anti-instagram node which is supposed to compute a color transformation in fixed intervals of time. However, when it came down we acutally had to change this for the follwing reason: The continuous anti-instagram node, running at an update interval of 10 seconds, consumes a considerable amount of computing power, namely 80%. In addition to that, the image transformer node which is in fact transforming the whole image and currently running at 4 Hz needs another 74% of one kernel. If you now run those two algorithms combined with the lane-following demo which makes the vehicle move and combined with our own code which needs an additional 75% of computing power, our safety critical module could only run at 1.5Hz and resulted in poor behaviour.

Even if you increase the time interval in which the continuous anti-instagram node computes a new transformation there was no real improvement. That is why in our final setup we let the anti-instagram node once compute a reasonable transformation and then keep this one for the entire drive. Through this measure we were able to safe the 80% share entirely and this allowed our overall node to be run at about 3 Hz with introducing an additional maximal delay of about 0,3 seconds. Nevertheless we want to point out that all the infrastructure for using the continuous anti instagram node in the future is provided in our package.

To sum up, the interface between our node and the Anti Instagram node was for sure developed very well and the collabroation was very good but when it came to getting the code to work, we had to take one step back to achieve good performance. That is why it might be reasonable to put effort into this interface in the future, to being able to more efficiently transform an entire image and to reduce the computational power consumed by the node which continuously computes a new transformation.

Evaluation of the Obstacle Detection

Modified 2018-02-18 by Niklas Funk

In general, since our obstacle classification algorithm is based on the rotational invariant feature of the eigenvalues of the inertia tensor it is completely invariant to the current orientation of the duckiebot and its position with respect to the lanes.

To rigorously evaluate our detection algorithm, we started off with evaluating static scenes, meaning the Duckiebot is standing still and not moving at all. Our algorithm performed extremely well in those static situations. You can place an arbitrary amount of obstacles, where the orientation of the respective obstacles does not matter at all, in front of the Duckiebot. In those situations and also combining them with changing the relative orientation of the Duckiebot itself, we achieved a false positive percentage of below 1% and we labelled all of the obstacles with respect to the lane boundaries correctly. The only static setup which is sometimes problematic is when we place the smallest duckies very close in front of our vehicle (below 4 centimeters), without approaching them. Then we sometimes cannot detect them. However this problem is mostly avoided during the dynamic driving, since we anyways want to stop earlier than 4 centimeters in front of potential obstacles. We are very happy with this static behaviour as in the worst case, if during the dynamic drive something goes wrong, you can still simply stop and rely upon the fact that the static performance is very good before continuing your drive. In the log chapter it is possible to find the corresponding logs.

This in return also implies that most of the misclassification errors during our dynamic drive are due to the effect of motion blur, assuming a stable color transformation provided by the anti instagram module. E.g. in Figure 3.40 two line segments in the background “blurred” together for two consecutive frames resulting in being labelled as an obstacle.

Obstacle Detector Error due to motion blur

Speaking more about of numbers, we took 2 duckiebots at a gain of around 0.6 and performed two drives at different days, so also at different lights and the results are the following: Evaluating each picture which will be given to the algorithm, we found out that on average, we detect 97% of all the yellow duckies in each picture. In terms of cones we detect about 96% of all cones in the evaluated frames. We consider these to be very good results as we have a very low rate of false positives (below 3%).

Date #correctly detected duckies #correctly detected cones #missed ducks #missed cones #false positive #false position
19/12/2017 423 192 14 8 9 45
Robot:Arki 3,2% 4% 1,4% 7,2%
21/12/2017 387 103 10 5 15 28
Robot:Dori 2,5% 4,4% 3% 5,7%

When it comes to evaluating the performance of our obstacle classification with respect to classifying them as dangerous or not dangerous our performance is not as good as the detection itself, but we did also not put the same effort into it. As you can see in the table above, we have an error rate of above 5% when it comes to determining whether the obstalce’s position is inside or outside the lane boundaries (this is denoted as false position in the table above). We are especially encountering problems when there is direct illumination on the yellow lines which are very reflective and therefore appear whitish. Figure 3.42 shows such a situation where the current implementation of our obstacle classification algorithm fails.

Obstacle Detector Classification Error

Evaluation of the Obstacle Avoidance

Modified 2018-02-18 by Niklas Funk

Since at the current state we stop for every obstacle which is inside the lane and inside the bounding box, the avoidance process is very stable since it does not have to generate avoidance trajectories. The final performance on the avoidance is mainly relying on the placement of the obstacles:

1. Obstacle placement on a straight: If the obstacle is placed on a straight with a sufficient distance from the corner the emergency stop works nearly every time if the obstacle is detected correctly.

2. Obstacle in a corner: Due to the currently missing information of the curvature of the current tile the bounding box is always rectangular in front of the robot. This leads to problems if an obstacle is placed in a corner because it might enter the bounding box very late (if at all). Since the detection very close to the robot is not possible, this can lead to crashes.

3. Obstacles on intersection: These were not yet considered in our scope but still work if the detection is correct. It then behaves similar to case 1.

Furthermore there a few cases which can lead to problems independent of the obstacle placement: 1. Controller oscillations: If the lane controller sees a lot of lag due to high computing loads or similar its control sometimes start to oscillate. These oscillations lead to a lot of motion blur which can induce problems in the detection and shorten the available reaction time to trigger an emergency stop.

2. Controller offsets: The current size of the bounding box assumes that the robot is driving in the middle of the lane. If the robot is driving with an offset to the middle of the lane it can happen that obstacles at the side of the lane aren’t detected. This however rarely leads to crashes because then the robot is just avoiding the obstacle instead of stopping for it.

While testing our algorithms we saw successfull emergency stops in 10/10 cases for obstacles on a straight and in 3/10 cases for obstacles placed in corners assuming that the controller was acting normally. It is to be noted that the focus was lying on the reliable detections on the straights, which we intended to show on the demo day.

Future Avenues of Development

Modified 2018-02-18 by Niklas Funk

As already described above in the eval interface section, we think that there is still room for improving the interface between our code and the Anti Instagram module in terms of making the continouus anti instagram node as well as the image transformer node more computationally efficient. Another interesting thought which might be taken into consideration concerning this interface is the follwoing: As long as the main part of the anti instagram’s color correction is linear (which was in most of our cases sufficient), it might be reasonable to just adapt the filter values than to subscribe to a fully transformed image. This effort could save a whole publisher and subscriber and it is obvious that it is by far more efficient to transform a few filter values once than to transform every pixel of every incoming picture. Towards the end of our project we invested some time in trying to get this approach to work but as time was not enough we could not make it. We especially struggled to transform the orange filter values, while it worked for the yellow ones (BRANCH: devel-saviors-ai-tryout2). We think that if in the future one will stick to the current hardware this might be a very interesting approach, also for other software components such as the lane detection or any other picture related algorithms which are based on the concept of filtering colors.

Another idea of our team would be to exploit the transformation to the bird’s view also for other modules. We think that this approach might be of interest e.g. for extracting the curvature of the road or performing the lane detection from the rather more undistorted top view.

Another area of improvement would be to further develop our provided scripts to being able to automatically evaluate the performance of our entire pipeline. As you can see in our code description in github there is a complete set of scripts available which makes it easily possible to transform a bag of raw camera images to a set of pictures on which we applied our obstacle detector, including the color correction part of Anti Instagram. The only missing step left is an automatic detection whether the drawn box is correct and in fact around an object which is considered to be an obstacle or not.

Furthermore to achieve more general performance propably even adaptions in the hardware might be considered (see [10]) to tune the obstacle detection algorithm and especially its generality. We think that setting up a neural network might make it possible to release the restrictions on the color of the obstacles.

In terms of avoidance there would be possibilities to handle the high inacurracies of the pose estimation by relying on the lane controller to not leave the lane and just use a kind of closed loop control to avoid the obstacle (use the new position of the detected obstacle in each frame to securely avoid the duckie). Applying filters to the signals, especially the heading estimation, could further improve the behaviour. This problem was detected late in the development and could not be tested due to time constraints. Going further, having both the line and obstacle detection in the same algorithm would allow the direct information on how far away obstacles are from the track. We expect that this would increase the accuracy compared to computing each individually and bringing it together.

The infrastructure is in place to include new scenarios like obstacles on intersection or multiple detected obstacles inside the bounding box. If multiple obstacles are in proximity, a more sophisticated trajectory generation could be put in place to avoid these obstacles in a safe and optimal way.

Furthermore the avoidance in corners could be easily significantly improved if the line detection would estimate the curvature of the current tile which would enable adaptions to the bounding box oncorner tiles. If the pose estimation is significantly improved one could also implement an adaptive bounding box which takes exactly the form of the lane in front of the robot (see Figure 3.44)

Adaptive bounding box

Theory Chapter

Modified 2018-02-14 by Julian Nubert

Inverse Perspective Mapping / Bird’s View Perspective

Modified 2018-02-18 by Niklas Funk

The first chapter above introduced the rough theory which is needed for understanding the follwing parts. The important additional information that we exploited heavily in our approach is that in our special case we know the coordinate $Z_W$. The reason therefore lies within the fact that unlike in another more general usecase of a mono camera, we know that our camera will always be at height $h$ with repsect to the street plane and that the angle $\theta_0$ also always stays constant. (Figure 3.54)

Illustration of our fixed camera position [15]

This information is used in the actual extrinsic calibration such that in Duckietown, due to the assumption that everything we see should in general be on the road, we can determine the full real world coordinates of every pixel, since we know the coordinate $Z_W$ which uniquely defines the absolute scale and can therefore uniquely determine $\lambda$ and H! Intuitively this comes from the fact that we can just intersect the known ray direction (see Figure 3.48) with the known “gound plane”.

This makes it possible to project every pixel back into the “road plane” by computing for each available pixel: $$ \vec{P_W}=H^{-1} * \lambda * P_{pix} $$

This “projection back onto the road plane” is called inverse perspective mapping!

If you now visualize this “back” projection, you basically get the bird’s view since you can now map back every pixel in the image plane to a unique place on the road plane.

The only trick of this easy maths is that we exploited the knowledge that everything we see in the image plane is in fact on the road and has one and the same z-coordinate. You can see that the original input image Figure 3.56 is nicely transformed into the view from above where every texture and shape is nicely reconstructed if this assumption is valid Figure 3.58. You can especially see that all the yellow line segments in the middle of the road roughly have the same size in this bird’s view Figure 3.58 which is very different if you compare it to the original image Figure 3.56.

Normal incoming image without any obstacle
Incoming image without obstacle reconstructed in bird's view

The crucial part is now what happens in this bird’s view perspective, if the camera sees an object which is not entirely part of the ground plane, but stands out. These are basically obstacles we want to detect. If we still transform the whole image to the bird’s view, these obstacles which stand out of the image plane get heavily disturbed. Lets explain this by having a look at Figure 3.60.

Illustration why obstacle standing out of ground plane is heavily disturbed in bird's view, modified: [15]

The upper picture in Figure 3.60 depicts the real world situation, where the cone is standing out ot the image plane and therefore the tip is obviously not at the same height as the ground plane. However, as we still have this assumption and as stated above intuitively intersect the ray with the ground plane, the cone gets heavily disturbed and will look like the lower picture in Figure 3.60 after performing the inverse perspective mapping. From this follows that if there are any objects which DO stand out of the image plane then in the inverse perspective you basically see their shape being projected onto the ground plane. This behaviour can be easily exploited since all of these objects are heavily disturbed, drastically increase in size and can therefore be easily separated from the other objects which belong to the ground plane.

Let’s have one final look at an example in Duckietown. In Figure 3.62 you see an incoming picture seen from the normal camera perspective, including obstacles. If you now perform the inverse perspective mapping, the picture looks like Figure 3.64 and as you can easily see, all the obstacles, namely the two yellow duckies and the orange cone which stand out of the ground plane are heavily disturbed and therefore it is quite easy to detect them as real obstacles.

Normal situation with obstacles in Duckietown seen from Duckiebot perspective
Same situation seen from bird's perspective

HSV Color Space

Modified 2018-02-14 by Julian Nubert

Introduction and Motivation

The “typical” color model is called the RGB color model. It simply uses three numbers for the amount of the colors red, blue and green. It is an additive color system, so we can simply add two colors to produce a third one. Mathematically written it looks as follows and shows the way of how we deal with producing new colors:

$$ \left( \begin{array}{c} r_{res} \\ g_{res} \\ b_{res} \end{array} \right) = \left( \begin{array}{c} r_{1} \\ g_{1} \\ b_{1} \end{array} \right) + \left( \begin{array}{c} r_{2} \\ g_{2} \\ b_{2} \end{array} \right) $$

If the resulting color is white, the two colors 1 and 2 are called to be complementary (e.g. this is the case for blue and yellow).

This color system is very intuitive and is oriented on how the human vision perceives the different colors.

The HSV color space is an alternative representation of the RGB color model. On this occasion HSV is an acronym for Hue, Saturation and Value. It is not so easy summable as the RGB model and it is also hardly readable for humans. So the big question is: Why should we transform our colors to the HSV space? Does it derive a benefit?

The answer is yes. It is hardly readable for humans but it is way better to filter for specific colors. If we look at the definition openCV gives for the RGB space, the higher complexity for some tasks becomes obvious:

In the RGB color space all “the three channels are effectively correlated by the amount of light hitting the surface”, so the color and light properties are simply not separated. (see: [17])

Expressed in a more simpler way: In the RGB space the colors also influence the brightness and the brightness influences the colors. However, in the HSV space, there is only one channel - the H channel - to describe the color. The S channel represents the saturation and H the intensity. This is the reason why it is super useful for specific color filtering tasks.

The HSV color space is therefore often used by people who try to select specific colors. It corresponds better to how we experience color. As we let the H (Hue) channel go from 0 to 1, the colors vary from red through yellow, green, cyan, blue, magenta and back to red. So we have red values at 0 as well as at 1. As we vary the S (saturation) from 0 to 1 the colors simply vary from unsaturated (more grey like) to fully saturated (no white component at all). Increasing the V (value) the colors just become brighter. This color space is illustrated in Figure 3.66. (see: [18])

Illustration of the HSV Color Space [18]

Most systems use the so called RGB additive primary colors. The resulting mixtures can be very diverse. The variety of colors, called the gamut, can therefore be very large. Anyway, the relationship between the constitutent amounts of red, green, and blue lights is unintuitive.


The HSV model can be derived using geometric strategies. The RGB color space is simply a cube where the addition of the three color components (with a scale form 0 to 1) is displayed. You can see this on the left of Figure 3.68.

Comparison between the two colors spaces [20]

You can now simply take this cube and tilt it on its corner. We do it this way so that black rests at the orgin whereas white is the highest point directly above it along the vertical axis. Afterwards you can just measure the hue of the colors by their angle around the vertical axis (red is denoted as 0°). Going from the middle to the outer parts from 0 (where the grey like parts are) to 1 determines the saturation. This is illustrated in Figure 3.70.

'Cutting the cube' [21]

The definitions of hue and chroma (proportion of the distance from the origin to the edge of the hexagon) amount to a geometric warping of hexagons into circles (for more informations see: [21]). Each side of the hexagon is mapped linearly onto a 60° arc of the circle. This is visualized in Figure 3.72.

Warping hexagons to circles [21]

For the value or lightness there are several possibilities to define an appropriate dimension for the color space. The simplest one is just the average of the three components, which is nothing else then the vertical height of a point in our tilted cubic. For this case we have:

$$ I = 1/3 * (R + G + B) $$

For another definition the value is defined as the largest component of a color. This places all three primaries and also all of the “secondary colors” (cyan, magenta, yellow) into a plane with white. This forms a hexagonal pyramid out of the RGB cube. This is called the HSV “hexcone” model and is the common one. We get:

$$ V = max(R, G, B) $$

(see: ([21]))

In Practice

1. Form a hexagon by projecting the RGB unit cube along its pincipal diagonal onto a plane.

First layer of the cube (left) and flat hexagon (right) [25]

2. Repeat projection with smaller RGB cube (subtract 1/255 in length of every cube) to obtain smaller projected hexagon. Like this a HSV hexcone is formed by stacking up the 256 hexagons in decreasing order of size.

Stacking hexagons together [25]

Then the value is again defined as:

$$ V = max(R, G, B) $$

3. Smooth edges of hexagon to circles (see previous chapter).


One nice example of the application of the HSV color space can be seen in Figure 3.78.

Image on the left is original. Image on the right was simply produced by rotating the H of each color by -30° while keeping S and V constant [21]

It just shows how simple color manipulation can be performed in a very intuitive way. We can turn many different applications to good account using this approach. As you have seen, color filtering also simply becomes a threshold query.

Because of mathjax bug

No questions found. You can ask a question on the website.