Performance and System Optimisation of complete Stereo Vision program

Previously, I have written about the performance of different algorithms, and explained why we chose the OpenCV implementation of Block Matching (StereoBM). I have also written about some results we obtained after trying as yet unpublished technique. At that stage, we were able to achieve 7.5 frames per second with a maximum number of disparities of 80, and image size of 320 x 240.

After implementing the entire system around it, like capturing images from camera, doing blob detection and display, and taking it into account, we were getting a paltry 3 – 4 fps.

Initial Optimisations

  1. Decreased the maximum number of disparities from 80 to 48. This increases the distance to the closest object that we can detect to about 1m, but that is acceptable, as anything closer would probably be within the car’s stopping distance anyway, and so there would be a crash regardless of whether we detect it or not.
  2. Cropped the image to reduce the peripheral information in the image. The decisions should be made by what’s directly in front of the car, so there is no point in wasting time doing computations on the sky.

With these changes, we were able to achieve around 5 fps. A small improvement, but an improvement nonetheless, so we have made these changes permanent.

In-depth Performance Analysis

To understand what was eating up the CPU time, I measured the time taken for all of the individual steps in the process per iteration (i.e. one stereo image set). Bearing in mind that all of the steps are happening sequentially, the numbers (in fps) are as follows:

Acquiring Images Pre-processing images Calculating Disparity + Blob Detection Overall
25 24 8.5 5

Just an aside – we have recently discovered that acquiring images is dependent on the light level as well, because our camera is a bit rubbish. Assuming an adequate quantity of light, however, we see that we can acquire images at 25 fps from the camera, pre-process them in real-time (if happening in parallel). I combined Calculating Disparity and Blob detection because Blob Detection is very fast. Since all this is happening sequentially, and there are other smaller overheads, the overall speed turns out to be 5 fps.

System Optimisations

The Pandaboard ES has dual core ARM Cortex –A9 with the cores running at 1.2GHz each. At this stage, we are using one core for all of the process, and one core for hosting a server. The server core is mostly idle. We need to ignore the server for the moment, and modify our system to introduce some serious concurrency!

Implementation option 1

The simplest way to do this would be to acquire images and pre-process them on one core, and calculate the disparity and blob detection on the second core. This can be achieved by creating two storage buffers, buf0 and buf1. The cameras grab the image, and after the pre-processing, we store it in buf0. Buf0 can then be copied to buf1, and we can start doing disparity calculations on buf1. While this is happening, the first thread can refresh the images from the camera in buf0, and so on, so that acquiring + pre-processing images and calculating disparity + blob detection happen concurrently. Here is the idea in diagram form:

There are some drawbacks to this approach however, primarily the fact that we have to copy. Also, as we established before, image acquisition + pre-processing happens faster than calculating disparity, so we would have to frequently wait for disparity calculation to finish, before being able to copy data into buf1. This would affect performance significantly.

Implementation option 2

The second way to do this is a little more complicated, so naturally, that is what we have implemented. Instead of having a separate buffer for storing images, then copying them for calculating disparity, we now have a two way buffer system. We have two buffers that are being shared between the two cores, and we write alternatively to them. So we have two buffers, buf0, and buf1. We first grab the images from the camera and store them in buf0. We then grab the next pair of images and store them in buf1. While we are working on getting the next pair of images, we start calculating on buf0. Once that is complete, we start calculating on buf1, and storing images in buf0. This way, we deal with two sets of frames in one loop iteration, and avoid any overhead associated with copying the data.

I should mention that we are using OpenMP for parallelisation. We need to make sure that only one process is access any of the buffers at one time. We don’t want to be calculating on buf0 while grabbing and storing images in it at the same time! This would result in all sorts of horribleness. To avoid this issue, we use the concept of mutual exclusion, and make use of OpenMP critical directive. While a section of code is surrounded by a #pragma omp critical(name), only one thread is allowed to execute inside that region. So all of the areas of code which access buf0 wil be surrounded by #pragma omp critical(buffer0). If a thread reaches that statement, while the other thread is inside the critical region, that thread will be forced to wait until the other one has exited.

Figure 3 Diagramatically: the system only progresses when both conditions are met for a transition to take place.

Initially, before we get inside the loop for an indefinite amount of time, we need to fill the buffers.

The program has two distinct critical sections, one for each buffer. They occur on both cores, and the execution of the program is as in the animated diagram:

Now that we have the basic system in place, we need to take care of Blob detection. As we know, image acquisition is faster than disparity calculation, so to avoid Core 0 to wait long, we should give it as much work to do as possible. As such, we’ve decided to do the Blob Detection and decision making on core 0. To do this, the output of the disparity map is stored by core1 in a buffers dispBuf0, and dispBuf1. Before getting the next image from the camera, we perform blob detection on the previous disparity map. This is best explained with another diagram:

The different colours represent the two critical sections. The two rows inside the box are the processes happening on the two cores. The blue and orange boxes that are in the same columns execute concurrently. So while core1 is calculating disparity from imgBuf1, core0 is doing blob detection on the disparity calculated from imgBuf0, then getting the next images from the camera, and so on. Notice that, when we do blob detection, and decide that the car needs to slow down or speed up, we change the maximum number of disparities. However, by the time we are ready to communicate this decision, the system has already calculated disparity for the next frame using the existing max number of disparities, so we need to invalidate the data in that dispBuf, as it no longer represents the current situation.

In every iteration of our main program loop, we are able to process two new frames. And so with this system in place, we can get 15 – 16 frames per second, almost double of what we measured as the stand-alone time for calculating disparity + blob detection earlier!


With our new concurrent system, we are able to acquire and pre-process images while calculating disparities at the same time. We avoid unnecessary overhead by making use of a two way buffering system. Since we know that the bottleneck of the system is Calculating disparity, we can offload all of the other work to Core0 so it doesn’t have to wait as long for the buffers to be depleted. We have managed to get about 15 – 16 frames, which is good enough for real-time stereo vision, and is approximately 3 times faster than our previous sequential, single core system.


Changing disparity range dynamically

Hello! It’s been a while since the last post! That doesn’t mean we’ve been idle, though. There has been lots of progress on the stereo vision setup in terms of performance and robustness as we gear up for a demonstration in a week’s time. More on that in later posts though, right now I’d just like to talk about a small but important feature.

In a real car, when you see an object in the distance, you slow down as you approach the object, eventually coming to a halt close to it. We wanted to accomplish something similar to this for our autonomous car.

To achieve this, we need to be able to perceive the distance to said object. From the stereo pair of images and the disparity map, it is possible to calculate the absolute distance to an object, but that is computationally expensive, and we wanted to avoid this, and make the most of the necessary computations we have already performed.

As such, the solution we have implemented is thus:

  1. Let’s say the maximum number of disparities is set to 48 (16 * 3). With our camera setup, the closest object we can detect (i.e. brightest/whitest on the disparity map) is roughly 1 meter away.
  2. Let’s say we are 2 meters away from this object. As we get closer to it, the level of white corresponding to this object increases. Once this passes a certain threshold, we know we are close to the object, and can tell the car to slow down a notch.
  3. As the car slows down, we increase the maximum number of disparities to 16 * 4. Now the distance to the closest object we can detect decreases, so we don’t lose sight of it. We can then repeat the same procedure, where passing over a certain threshold level of white in our disparity map causes us to slow down, and increase the number of disparities.
  4. We repeat this process until we get to a maximum maximum number of disparities. Let’s say this is 128 (16 * 8), which gives a range of about 15 – 20 cm. Now when we cross over the whiteness threshold, we are travelling very slowly, and we know that we are about 15 – 20 cm away from an object, so we can tell the car to stop.

The actual numbers can be set depending on the required range, how close you want to stop to the object, and basically just experimenting around with the actual car.

There are some obvious limitations to this approach though:

  1. As we get closer to the object and increase the max number of disparities, the frame rate drops as calculating the disparity map takes longer. However, this coincides with the car slowing down, so we have more time per frame.
  2. Unless you have been ‘tracking’ an object as it gets closer, any object closer than what the initial number of disparities allows will not be detected. For example, if I stick my leg out about 20 cm in front of the car, it will not be detected and the car will crash into me. The disparity map will mostly be black (i.e. error pixels). To overcome this problem, we could perhaps modify the algorithm so that it increases the max number of disparities (thus allowing us to ‘look’ closer) if the number of error pixels is over a threshold, or simply stop the car if that is the case. We could also use a hardware proximity sensor as an input to put on an emergency break.

We have yet to see how this approach works with a relatively fast moving car, but testing it with moving objects in front of a stationary camera indicates promising results.


Quick way to make sure camera input is correct (overlaying images using OpenCV)

As I mentioned in one of the earlier posts, we need to limit the camera resolution to 320 x 240 to allow streaming to the Pandaboard. This means that the camera needed to recalibrated. I did that, but when using the images obtained at the lower resolution and rectified by the new calibration parameters, I ran into a strange problem: closer objects seemed darker than the ones further away in the disparity map! This is the opposite of what you’d except, and since running the same algorithm on a bunch of test images was producing the correct output, I knew it was something to do with the camera input. So after a suggestion from our supervisor, I wrote a small function that overlays one image on top of the other. Theoretically, objects closer to the camera should have a larger distance between them in the two pictures than objects further away.

The right image is superimposed on the left image.

Sorry about the picture above being quite dark (it was the first image captured by the camera as it turned on, converted to grayscale, and blurred), but you can see that there is a huge difference between the position of the microwave oven in the two pictures, more so than the head (in the bottom right). The microwave is far enough away that it should be roughly in the same place in both the images. This explains why the disparity was inverted on the disparity map, so this problem should be fixed once I correctly (re)calibrate the cameras.

I could not find a lot of resources for writing a function that overlays two images, so here it is:

//Takes in a custom data structure that holds the left and right image.
//Can be easily modified to take in individual images instead.
//Scale determines the weighting of each image
Mat OverlayImages(StereoPair camImages, double scale)
    //Create new matrix for storing output
    Mat overlay = Mat(camImages.leftImage.size(), camImages.leftImage.type());

    //Initialise pointers to data in each of the matrices
    uint8_t* l = (uint8_t*);
    uint8_t* r = (uint8_t*);
    uint8_t* n = (uint8_t*);

    int cn = camImages.leftImage.channels();
    int cols = camImages.leftImage.cols;

    for(int i = 0; i < camImages.leftImage.rows; i++)
        for(int j = 0; j < camImages.leftImage.cols; j += cn)
            Scalar_<uint8_t> lPixel;
            Scalar_<uint8_t> rPixel;

            lPixel.val[0] = l[i*cols*cn + j*cn + 0]; // B
            lPixel.val[1] = l[i*cols*cn + j*cn + 1]; // G
            lPixel.val[2] = l[i*cols*cn + j*cn + 2]; // R

            rPixel.val[0] = r[i*cols*cn + j*cn + 0]; // B
            rPixel.val[1] = r[i*cols*cn + j*cn + 1]; // G
            rPixel.val[2] = r[i*cols*cn + j*cn + 2]; // R

            n[i*cols*cn + j*cn + 0] = ((int)lPixel.val[0] * scale) + ((int)rPixel.val[0] * scale);
            n[i*cols*cn + j*cn + 1] = ((int)lPixel.val[1] * scale) + ((int)rPixel.val[1] * scale);
            n[i*cols*cn + j*cn + 2] = ((int)lPixel.val[2] * scale) + ((int)rPixel.val[2] * scale);
    return overlay;


Detecting objects in our disparity map

Now that we are fairly happy with our disparity map, we need a way to find objects in it, so that we can calculate the distance to them and decide whether the car needs to stop or not.

There are several methods that we came across for doing this, but the one we’ve decided on is segmenting the image via blob detection.

I started implementing this from scratch using contours, where you detect the contours in an image, and that essentially gives you a closed region of the image. You can then do further analysis on the contours, and discard those which are below a certain area, calculate the mean intensity of the pixels within a contour and so on.

Figure 1 Result of extracting contours, and filling them in with a random colour.

That’s how far I had got with my implementation when I discovered the cvBlobsLib library. It’s a complete library for blob detecting that integrates with OpenCV (note that OpenCV has a SimpleBlobDetector class but that is quite limited at the moment). cvBlobsLib basically implements all of the features that we might require, and probably does them faster than I could implemented, so why reinvent the wheel, right?

Installing cvBlobsLib on Linux

First, you must download appropriate archive for Linux from here. Extract the contents in a directory on your desktop, then follow the instructions in the readme file.

Then, in your eclipse project, under C/C++ Build -> Settings -> GCC C++ Linker -> Libraries, add blob in Libraries(-l), and under GCC C++ Compiler -> Includes, add /usr/local/include/cvblobs. And finally, in the working .cpp file, add an include directive #include <cvblobs/blob.h> (or #include “blob.h” if you stored the header files locally).

Using cvBlobsLib

cvBlobsLib only works with the C style IplImage object instead of the Mat object in OpenCV, but converting between the two is not is not that big an issue. Plus you can only change the header file, and then not have to copy all of the data from one format to the other, so there is no real performance impact.

//bm is our disparity map in a Mat
IplImage *dispIpl = new IplImage(bm);	//create an IplImage from Mat

//Declare variables
CBlobResult blobs;
CBlob *currentBlob;
int minArea = 1;

blobs = CBlobResult(dispIpl, NULL,0);  //get all blobs in the disparity map
blobs.Filter( blobs, B_EXCLUDE, CBlobGetArea(), B_LESS, minArea ); //filter blobs by area and remove all less than minArea

//Display blobs
IplImage *displayedImage = cvCreateImage(Size(640,480),8,3); //create image for outputting blobs
for (int i = 0; i &lt; blobs.GetNumBlobs(); i++ )
	currentBlob = blobs.GetBlob(i);
	Scalar color( rand()&amp;255, rand()&amp;255, rand()&amp;255 );
	currentBlob-&gt;FillBlob( displayedImage, color);
Mat displayImage = displayedImage; //Convert to Mat for use in imshow()
imshow("Blobs", displayImage);

Figure 2 Disparity map

Figure 3 Result of blob detection with minArea set to 1

We can also do some noise filtering by excluding blobs that are below a certain size.

Figure 4 Result of blob detection with minArea set to 50. Note that there is a lot less noise.

Another big problem we can see immediately is that the person in the foreground and the car in the background are detected as one region. This is because the edge of the person is not closed, and so if you were to draw a contour, it would go around the edges of the person and the car, like so:

Figure 5 Contours in disparity map. Note that the person and the object to the right share are surrounded by the same contour.

We’ve dealt with this problem using morphological filters.

Morphological filtering

“Morphological filtering is a theory developed in the 1960s for the analysis and processing of discrete images. It defines a series of operators which transform an image by probing it with a predefined shape element. The way this shape element intersects the neighbourhood of a pixel determines the result of the operation” [1].

OpenCV implements Erosion and dilation filters as simple functions. For this problem we need to use the erode filter.

erode(src, dst, Mat());  //default kernel is of size 3 x3

//Optionally, select kernel size
cv::Mat element(7,7,CV_8U,cv::Scalar(1));
erode(src, dst, element);

Figure 6 Filter with a 2×2 kernel. It does the job! But we need to experiment with more images to set a final value for the kernel size.

Figure 7 Erosion with kernel size 3×3 (left) and 7×7 (right) for illustration purposes. 7×7 is clearly very destructive.


We’ve isolated our objects of interest! Now all that remains to be done is go over the blobs, find their average intensity, and calculate the distance!

Sources: [1], cvBlobsLib


Video of a road scene

This is just a quick post to put up the video of the algorithm running on a road scene. The video source for the algorithm was actually a bunch of image files, which I had to first read in from the hard drive sequentially. I’ll write a post about that tomorrow, as well on how to create a video using OpenCV.

The video was generated by running the algorithm on the images, and simply collating the output at 10fps.

The original video (from the left camera is below.

You can clearly see the closer objects as brighter, and I especially like the cars in the background getting darker as they move away!.

Aaaaaaa! Working with the pandaboard is so annoying!

“I have a secret vendetta against you, muhahaha”

This evening, I have been working on getting our stereo vision algorithms working on the pandaboard (so far, I’ve been testing them on my laptop). All well and good, I thought. But oh no, problems, as ever, arose by the tonne.

Firstly, some of the libraries that I was using (concurrency, time) had to be removed, because they are part of Visual C++, and not vanilla C++. The time library was just being used to test performance, so I’m going to have to come up with another solution for that, but that’s only a minor niggle.

The concurrency library, however, is what I was using for parallelisation. Thankfully, I haven’t really done any parallelisation of the algorithms (I had just done acquiring images from the camera in parallel), but we were planning on parallelising some other stuff. I suppose we’ll have to learn OpenMP as well or something, but maybe that’s not such a bad thing.

The biggest issue, however, is that the USB port on the pandaboard doesn’t have enough bandwidth to support streaming from both cameras simultaneously at a resolution of 640 x 480…

I kept getting mysterious errors, but eventually worked out the issue. Alright, I thought, we’ll capture the images at 320 x 240 instead. It will speed up the algorithms too, albeit make our visual demonstration a bit less fantastical.

The next, as yet unsolved problem, is that all the camera calibration was done at 640 x480, and using those settings with 320 x 240 images results in just plain incorrect outputs. So I need to work out whether it is possible to scale the matrices appropriately (I think it is), or redo the calibration process at 320 x 240.

I was hoping to get this finished today, but ah well, almost there. On the plus side, the algorithm seemed to be running at a decent frame rate (it can run but the output is obviously wrong) on the board (not that I could test the performance numerically).

Can’t wait for portable supercomputers.


Comparison of some stereo vision algorithms


We’ve played with 4 different implementations of stereo vision algorithms. Two of these, Block Matching (BM), and Semi Global Block Matching (SGBM), we are just using implementations provided by OpenCV. The other two, simple Sum of Absolutely Differences (SAD) and Normalised Cross Correlation (NCC) we have implemented ourselves.

I did some testing earlier to help us decide which implementation we should use. The results are presented below.


We are not doing any quantitative testing of the algorithms (like comparing the outputs to a ground truth and calculating percentage of error pixels), just eye-balling.

The following pictures represent, looking from left to right (as you would read)are : the original image, BM, SGBM, SAD and NCC. The SAD window size is set to 9, and the number of disparity levels is set to 80 for all of the algorithms.



It should be noted that the OpenCV algorithms deal with occlusion (i.e. when pixels in an image have no corresponding pixels in the other image). Occluded areas appear in black. In the algorithms we implemented (SAD and NCC), they are not taken care of, and so errors are likely introduced.

All the algorithms are somewhat successful at giving an appropriate depth for objects in the original images. They all struggle with large homogenous textures (like the road) and which introduces some error. From these samples, SGBM appears to produce the cleanest map (notice especially how there is very little noise in the sky region for the bottom two examples, and the road is somewhat appropriately shaded). BM doesn’t do a good job, at all, of the road, as it’s all mostly shows up as black in the disparity map.

It should also be noted that we can reduce the noise level in the images by increasing the window size with the loss of some detail. For example, the following images are calculated using a window size of 21, instead of 9: (first to last: BM, SGBM, SAD, NCCR)


The output is obviously smoother, and suddenly BM isn’t looking so bad. The only problem still is that the road is all black, whereas theoretically it should be smoothly go from white to black. But I’m sure this is something we can deal with in the context of our car.

We ran all of the algorithms on the above three images 10 times and noted the average time it took to run the algorithms. This was done on a laptop with an AMD Phenom X4 running at 2GHz. Please note that the OpenCV functions are optimised, whereas we have done absolutely zero optimisation to the algorithms we wrote. And it clearly shows.


Image 1 

Road 1 

Road 2 


Average (s) 


Average (s) 


Average (s) 































The first thing to notice from that is our algorithms are significantly slower than the OpenCV ones. The second thing to notice is that BM is significantly faster than all the other algorithms. Something to keep in mind is that we are aiming for is real-time stereo vision on a pandaboard with a processor much slower than my test rig. Bearing that in mind, and knowing the no matter how much optimisation we do in the limited time we have, our algorithms won’t perform nearly as well as the OpenCV ones. But that’s okay, because our results are not that great anyway.

After all that, I think we will go for BM as our algorithm of choice, primarily because it’s faster and the results are acceptable. There is that problem with large homogenous textures, but I’m sure we can work around it. Plus, I am confident that we can implement it at decent frame rates on the panda board with some optimisations of our own, and by making use of the DSP.