Detecting objects in our disparity map

Now that we are fairly happy with our disparity map, we need a way to find objects in it, so that we can calculate the distance to them and decide whether the car needs to stop or not.

There are several methods that we came across for doing this, but the one we’ve decided on is segmenting the image via blob detection.

I started implementing this from scratch using contours, where you detect the contours in an image, and that essentially gives you a closed region of the image. You can then do further analysis on the contours, and discard those which are below a certain area, calculate the mean intensity of the pixels within a contour and so on.

Figure 1 Result of extracting contours, and filling them in with a random colour.

That’s how far I had got with my implementation when I discovered the cvBlobsLib library. It’s a complete library for blob detecting that integrates with OpenCV (note that OpenCV has a SimpleBlobDetector class but that is quite limited at the moment). cvBlobsLib basically implements all of the features that we might require, and probably does them faster than I could implemented, so why reinvent the wheel, right?

Installing cvBlobsLib on Linux

First, you must download appropriate archive for Linux from here. Extract the contents in a directory on your desktop, then follow the instructions in the readme file.

Then, in your eclipse project, under C/C++ Build -> Settings -> GCC C++ Linker -> Libraries, add blob in Libraries(-l), and under GCC C++ Compiler -> Includes, add /usr/local/include/cvblobs. And finally, in the working .cpp file, add an include directive #include <cvblobs/blob.h> (or #include “blob.h” if you stored the header files locally).

Using cvBlobsLib

cvBlobsLib only works with the C style IplImage object instead of the Mat object in OpenCV, but converting between the two is not is not that big an issue. Plus you can only change the header file, and then not have to copy all of the data from one format to the other, so there is no real performance impact.

//bm is our disparity map in a Mat
IplImage *dispIpl = new IplImage(bm);	//create an IplImage from Mat

//Declare variables
CBlobResult blobs;
CBlob *currentBlob;
int minArea = 1;

blobs = CBlobResult(dispIpl, NULL,0);  //get all blobs in the disparity map
blobs.Filter( blobs, B_EXCLUDE, CBlobGetArea(), B_LESS, minArea ); //filter blobs by area and remove all less than minArea

//Display blobs
IplImage *displayedImage = cvCreateImage(Size(640,480),8,3); //create image for outputting blobs
for (int i = 0; i &lt; blobs.GetNumBlobs(); i++ )
	currentBlob = blobs.GetBlob(i);
	Scalar color( rand()&amp;255, rand()&amp;255, rand()&amp;255 );
	currentBlob-&gt;FillBlob( displayedImage, color);
Mat displayImage = displayedImage; //Convert to Mat for use in imshow()
imshow("Blobs", displayImage);

Figure 2 Disparity map

Figure 3 Result of blob detection with minArea set to 1

We can also do some noise filtering by excluding blobs that are below a certain size.

Figure 4 Result of blob detection with minArea set to 50. Note that there is a lot less noise.

Another big problem we can see immediately is that the person in the foreground and the car in the background are detected as one region. This is because the edge of the person is not closed, and so if you were to draw a contour, it would go around the edges of the person and the car, like so:

Figure 5 Contours in disparity map. Note that the person and the object to the right share are surrounded by the same contour.

We’ve dealt with this problem using morphological filters.

Morphological filtering

“Morphological filtering is a theory developed in the 1960s for the analysis and processing of discrete images. It defines a series of operators which transform an image by probing it with a predefined shape element. The way this shape element intersects the neighbourhood of a pixel determines the result of the operation” [1].

OpenCV implements Erosion and dilation filters as simple functions. For this problem we need to use the erode filter.

erode(src, dst, Mat());  //default kernel is of size 3 x3

//Optionally, select kernel size
cv::Mat element(7,7,CV_8U,cv::Scalar(1));
erode(src, dst, element);

Figure 6 Filter with a 2×2 kernel. It does the job! But we need to experiment with more images to set a final value for the kernel size.

Figure 7 Erosion with kernel size 3×3 (left) and 7×7 (right) for illustration purposes. 7×7 is clearly very destructive.


We’ve isolated our objects of interest! Now all that remains to be done is go over the blobs, find their average intensity, and calculate the distance!

Sources: [1], cvBlobsLib


Video of a road scene

This is just a quick post to put up the video of the algorithm running on a road scene. The video source for the algorithm was actually a bunch of image files, which I had to first read in from the hard drive sequentially. I’ll write a post about that tomorrow, as well on how to create a video using OpenCV.

The video was generated by running the algorithm on the images, and simply collating the output at 10fps.

The original video (from the left camera is below.

You can clearly see the closer objects as brighter, and I especially like the cars in the background getting darker as they move away!.

Performance and Optimisation of the Stereo Vision algorithm on the Pandaboard

We’ve had a bit of a hiatus over the last few days, what with working on projects for other modules and all, but we’re back, and today, I’m going to talk about performance on the Pandaboard.

So far, all of the development and testing that I’ve been doing has been done on my laptop. We found in previous tests that on my laptop, using OpenCV’s StereoBM function, with images of 640 x 480, we can get roughly 10 frames per second. That’s probably good enough for controlling the car in real-time if we are not going too fast. But of course, on the Pandaboard, the frame rate is considerably slower.

One alleviating factor is that the Pandaboard’s USB bandwidth is limited so we can only stream images of 320 x 240 simultaneously. This gives us an excuse to use a lower resolution, which increases the performance. Below are the results of running the algorithm without any optimisations on the same image 200 times.

Table 1 Results of StereoBM algorithm over 200 runs, SAD window size = 21, number of disparities = 16*5

Min (s)

Max (s)

Average (s)







Despite the resolution of the image being half, the FPS is still half that of running the algorithm on the pc.


5 frames per second is not good enough. Even if the FPS was higher, we’d still be trying to optimise things to squeeze as much performance out of the board as we can. I cannot, unfortunately, disclose what optimisations I have implemented so far, as they have not been published yet.

As a starting point, I applied the techniques on a vanilla SAD algorithm, which, from a previous post, took roughly 60 seconds per frame. After said optimisations, the time dropped by a massive 30 seconds!

This was very encouraging, so I implemented the same sort of technique for use with OpenCV’s StereoBM function. To my surprise, the time taken for a frame almost doubled! I looked at the source code for the implementation, and found that it was so well optimised for (I assume) x86 architecture that it was practically unreadable! So I can only conclude that the pre-processing required for the techniques I applied introduce too much overhead, and were resulting in a deterioration of the performance.

Assuming OpenCV won’t be as well optimised for the ARM architecture (and systems with a fewer number of cores), I tested the optimisations on the Pandaboard, and sure enough there was some improvement. The following table shows the results from two functions, one which is capable of dealing with any arbitrary number of disparities (as long as they are multiples of 16), and the other one tuned specifically for 16*5 number of disparities (I did things like loop unrolling to minimise as much overhead as possible).

Table 2 Results of optimised implementations over 200 runs, SAD window size = 21, number of disparities = 16 *5

















There is an increase of almost 2.5 frames per second, which is a worthwhile improvement. There is not a huge difference between the specific implementation and the general implementation, however.

We can afford to be specific on the car, though, as once the parameters are set, they will not be changed (unless we allow for variable disparities depending on the scenario the car is driving in). The specific algorithm seems more consistent as well, with a smaller difference between the maximum time/frame and the average.

Bearing in mind that all of these implementations are still sequential (I will experiment with parallelising some parts of them soon), there is hope for yet more improvement. We will also probably calculate the disparity map for only part of the image, since we only really need to know what’s straight ahead of the car.

I am wary of the fact that after we have the disparity map, we will need to do further calculations per frame as well to determine if the map contains an object close enough to warrant action by the car. That is why it is so important to get the actual computation of the disparity map as fast as possible.


PandaBoard Streaming

In order to let us see what the car is doing, we needed a way to stream data between the car and a server. Having googled around, so far, two methods fit the bill.

First – C++ server → Websocket → HTML5 Client

This seems to be the best solution (if working) for the project as the clients can have the comfort of their HTML5 – supported browser ( while using Facebook at the same time) without any extra installation. Decided to follow the method implemented here:

Streaming from a fixed video works well. There wasn’t any frame drop or lag, and it works for mp4 and ogg formats. In order to stream captured videos, I’ve used OpenCV to capture frames, then writing every frame to video container using the built-in OpenCV VideoWriter and writeframe(). Ogg was used as the extension in this case as OpenCV doesn’t have any codec to write MP4 files. As the writing happens, it is streamed through the server mentioned above.

At the receiving end, results doesn’t look good at all. There is a massive lag and delay in the video received. This is expected as there are overhead from file I/Os, encoding and network delays. After some googling, there doesn’t seem to be a better solution in solving this problem at the moment as OpenCV doesn’t allow writing frames to memory instead of files. On the bright side, the recording from the webcam is playable and seems to be real time. We might incorporate this method into the project as part of recording the journey of the car.


  • Video stream is in colour
  • Can be accessed through a browser
  • Supports many devices
  • Streaming is recorded


  • Streaming is sluggish, massive delay of 30 seconds

Second Method – C++ Server → Websocket → C++ Client

Anyway, moving on, instead of streaming encoded data, I’ve decided to try streaming OpenCV data instead. The downside of this method is that the client end has to have OpenCV installed instead of using the more available browsers. Managed to stream data using the method from below :


Pros –

  • Faster speed
  • Much lesser delay (about 5 seconds only)
  • Doesn’t require a server to be setup.

Cons –

  • At the moment, image viewed is in gray-scale. (Working on color streaming)
  • Presence of delay defeats the objective of real-time

Verdict –

At the moment, the second method is better than the first method in displaying video  obtained from OpenCV. More options to be explored, hopefully its better than these 2. Results to be uploaded soonish in the future.

Aaaaaaa! Working with the pandaboard is so annoying!

“I have a secret vendetta against you, muhahaha”

This evening, I have been working on getting our stereo vision algorithms working on the pandaboard (so far, I’ve been testing them on my laptop). All well and good, I thought. But oh no, problems, as ever, arose by the tonne.

Firstly, some of the libraries that I was using (concurrency, time) had to be removed, because they are part of Visual C++, and not vanilla C++. The time library was just being used to test performance, so I’m going to have to come up with another solution for that, but that’s only a minor niggle.

The concurrency library, however, is what I was using for parallelisation. Thankfully, I haven’t really done any parallelisation of the algorithms (I had just done acquiring images from the camera in parallel), but we were planning on parallelising some other stuff. I suppose we’ll have to learn OpenMP as well or something, but maybe that’s not such a bad thing.

The biggest issue, however, is that the USB port on the pandaboard doesn’t have enough bandwidth to support streaming from both cameras simultaneously at a resolution of 640 x 480…

I kept getting mysterious errors, but eventually worked out the issue. Alright, I thought, we’ll capture the images at 320 x 240 instead. It will speed up the algorithms too, albeit make our visual demonstration a bit less fantastical.

The next, as yet unsolved problem, is that all the camera calibration was done at 640 x480, and using those settings with 320 x 240 images results in just plain incorrect outputs. So I need to work out whether it is possible to scale the matrices appropriately (I think it is), or redo the calibration process at 320 x 240.

I was hoping to get this finished today, but ah well, almost there. On the plus side, the algorithm seemed to be running at a decent frame rate (it can run but the output is obviously wrong) on the board (not that I could test the performance numerically).

Can’t wait for portable supercomputers.


Comparison of some stereo vision algorithms


We’ve played with 4 different implementations of stereo vision algorithms. Two of these, Block Matching (BM), and Semi Global Block Matching (SGBM), we are just using implementations provided by OpenCV. The other two, simple Sum of Absolutely Differences (SAD) and Normalised Cross Correlation (NCC) we have implemented ourselves.

I did some testing earlier to help us decide which implementation we should use. The results are presented below.


We are not doing any quantitative testing of the algorithms (like comparing the outputs to a ground truth and calculating percentage of error pixels), just eye-balling.

The following pictures represent, looking from left to right (as you would read)are : the original image, BM, SGBM, SAD and NCC. The SAD window size is set to 9, and the number of disparity levels is set to 80 for all of the algorithms.



It should be noted that the OpenCV algorithms deal with occlusion (i.e. when pixels in an image have no corresponding pixels in the other image). Occluded areas appear in black. In the algorithms we implemented (SAD and NCC), they are not taken care of, and so errors are likely introduced.

All the algorithms are somewhat successful at giving an appropriate depth for objects in the original images. They all struggle with large homogenous textures (like the road) and which introduces some error. From these samples, SGBM appears to produce the cleanest map (notice especially how there is very little noise in the sky region for the bottom two examples, and the road is somewhat appropriately shaded). BM doesn’t do a good job, at all, of the road, as it’s all mostly shows up as black in the disparity map.

It should also be noted that we can reduce the noise level in the images by increasing the window size with the loss of some detail. For example, the following images are calculated using a window size of 21, instead of 9: (first to last: BM, SGBM, SAD, NCCR)


The output is obviously smoother, and suddenly BM isn’t looking so bad. The only problem still is that the road is all black, whereas theoretically it should be smoothly go from white to black. But I’m sure this is something we can deal with in the context of our car.

We ran all of the algorithms on the above three images 10 times and noted the average time it took to run the algorithms. This was done on a laptop with an AMD Phenom X4 running at 2GHz. Please note that the OpenCV functions are optimised, whereas we have done absolutely zero optimisation to the algorithms we wrote. And it clearly shows.


Image 1 

Road 1 

Road 2 


Average (s) 


Average (s) 


Average (s) 































The first thing to notice from that is our algorithms are significantly slower than the OpenCV ones. The second thing to notice is that BM is significantly faster than all the other algorithms. Something to keep in mind is that we are aiming for is real-time stereo vision on a pandaboard with a processor much slower than my test rig. Bearing that in mind, and knowing the no matter how much optimisation we do in the limited time we have, our algorithms won’t perform nearly as well as the OpenCV ones. But that’s okay, because our results are not that great anyway.

After all that, I think we will go for BM as our algorithm of choice, primarily because it’s faster and the results are acceptable. There is that problem with large homogenous textures, but I’m sure we can work around it. Plus, I am confident that we can implement it at decent frame rates on the panda board with some optimisations of our own, and by making use of the DSP.


Rectifying images from stereo cameras

I mentioned in the earlier post about calibrating stereo cameras that the output that process is a bunch of matrices. This post is going to describe what the matrices are and how to use them to correct the images.

Description of files
The calibration process produces these files:

D1.xml D2.xml
M1.xml M2.xml
mx1.xml mx2.xml
my1.xml my2.xml
P1.xml P2.xml
R1.xml R2.xml

The files with the 1 are for camera 1, where the files with the 2 are for camera 2. The files m*.xml are the distortion models of the individual cameras. These would be used if you wanted to rectify an individual stream independently.
The D* are the distortion matrices, M* are the camera matrices, P* are the projection matrices and R* are the rotation matrices. The book has a lot more information about these, including the maths behind them and how they are computed behind the scenes.

While using the files turned out to be really simple, it took me quite a long time of rooting around in the documentation to work out what to do.

Mat left, right; //Create matrices for storing input images

//Create transformation and rectification maps
Mat cam1map1, cam1map2;
Mat cam2map1, cam2map2;

initUndistortRectifyMap(M1, D1, R1, P1, Size(640,480) , CV_16SC2, cam1map1, cam1map2);
initUndistortRectifyMap(M2, D2, R2, P2, Size(640,480) , CV_16SC2, cam2map1, cam2map2);

Mat leftStereoUndistorted, rightStereoUndistorted; //Create matrices for storing rectified images

/*Acquire images*/

//Rectify and undistort images
remap(left, leftStereoUndistorted, cam1map1, cam1map2, INTER_LINEAR);
remap(right, rightStereoUndistorted, cam2map1, cam2map2, INTER_LINEAR);

//Show rectified and undistorted images
imshow("LeftUndistorted", left); imshow("RightUndistored", right);

That’s it! You first use the initUndistortRectifyMap() function with the appropriate parameters obtained from calibration to generate a joint undistortion and rectification matrix in the form of maps for remap().

One interesting point to make (from the documentation) is that the resulting camera from this process is oriented differently in the coordinate space, according to R. This helps to align the images so that the epipolar lines on both images become horizontal and have the same y-coordinate.

For more information, have a look at the OpenCV documentation.