Previously, I have written about the performance of different algorithms, and explained why we chose the OpenCV implementation of Block Matching (StereoBM). I have also written about some results we obtained after trying as yet unpublished technique. At that stage, we were able to achieve 7.5 frames per second with a maximum number of disparities of 80, and image size of 320 x 240.
After implementing the entire system around it, like capturing images from camera, doing blob detection and display, and taking it into account, we were getting a paltry 3 – 4 fps.
- Decreased the maximum number of disparities from 80 to 48. This increases the distance to the closest object that we can detect to about 1m, but that is acceptable, as anything closer would probably be within the car’s stopping distance anyway, and so there would be a crash regardless of whether we detect it or not.
- Cropped the image to reduce the peripheral information in the image. The decisions should be made by what’s directly in front of the car, so there is no point in wasting time doing computations on the sky.
With these changes, we were able to achieve around 5 fps. A small improvement, but an improvement nonetheless, so we have made these changes permanent.
In-depth Performance Analysis
To understand what was eating up the CPU time, I measured the time taken for all of the individual steps in the process per iteration (i.e. one stereo image set). Bearing in mind that all of the steps are happening sequentially, the numbers (in fps) are as follows:
||Calculating Disparity + Blob Detection
Just an aside – we have recently discovered that acquiring images is dependent on the light level as well, because our camera is a bit rubbish. Assuming an adequate quantity of light, however, we see that we can acquire images at 25 fps from the camera, pre-process them in real-time (if happening in parallel). I combined Calculating Disparity and Blob detection because Blob Detection is very fast. Since all this is happening sequentially, and there are other smaller overheads, the overall speed turns out to be 5 fps.
The Pandaboard ES has dual core ARM Cortex –A9 with the cores running at 1.2GHz each. At this stage, we are using one core for all of the process, and one core for hosting a server. The server core is mostly idle. We need to ignore the server for the moment, and modify our system to introduce some serious concurrency!
Implementation option 1
The simplest way to do this would be to acquire images and pre-process them on one core, and calculate the disparity and blob detection on the second core. This can be achieved by creating two storage buffers, buf0 and buf1. The cameras grab the image, and after the pre-processing, we store it in buf0. Buf0 can then be copied to buf1, and we can start doing disparity calculations on buf1. While this is happening, the first thread can refresh the images from the camera in buf0, and so on, so that acquiring + pre-processing images and calculating disparity + blob detection happen concurrently. Here is the idea in diagram form:
There are some drawbacks to this approach however, primarily the fact that we have to copy. Also, as we established before, image acquisition + pre-processing happens faster than calculating disparity, so we would have to frequently wait for disparity calculation to finish, before being able to copy data into buf1. This would affect performance significantly.
Implementation option 2
The second way to do this is a little more complicated, so naturally, that is what we have implemented. Instead of having a separate buffer for storing images, then copying them for calculating disparity, we now have a two way buffer system. We have two buffers that are being shared between the two cores, and we write alternatively to them. So we have two buffers, buf0, and buf1. We first grab the images from the camera and store them in buf0. We then grab the next pair of images and store them in buf1. While we are working on getting the next pair of images, we start calculating on buf0. Once that is complete, we start calculating on buf1, and storing images in buf0. This way, we deal with two sets of frames in one loop iteration, and avoid any overhead associated with copying the data.
I should mention that we are using OpenMP for parallelisation. We need to make sure that only one process is access any of the buffers at one time. We don’t want to be calculating on buf0 while grabbing and storing images in it at the same time! This would result in all sorts of horribleness. To avoid this issue, we use the concept of mutual exclusion, and make use of OpenMP critical directive. While a section of code is surrounded by a #pragma omp critical(name), only one thread is allowed to execute inside that region. So all of the areas of code which access buf0 wil be surrounded by #pragma omp critical(buffer0). If a thread reaches that statement, while the other thread is inside the critical region, that thread will be forced to wait until the other one has exited.
Figure 3 Diagramatically: the system only progresses when both conditions are met for a transition to take place.
Initially, before we get inside the loop for an indefinite amount of time, we need to fill the buffers.
The program has two distinct critical sections, one for each buffer. They occur on both cores, and the execution of the program is as in the animated diagram:
Now that we have the basic system in place, we need to take care of Blob detection. As we know, image acquisition is faster than disparity calculation, so to avoid Core 0 to wait long, we should give it as much work to do as possible. As such, we’ve decided to do the Blob Detection and decision making on core 0. To do this, the output of the disparity map is stored by core1 in a buffers dispBuf0, and dispBuf1. Before getting the next image from the camera, we perform blob detection on the previous disparity map. This is best explained with another diagram:
The different colours represent the two critical sections. The two rows inside the box are the processes happening on the two cores. The blue and orange boxes that are in the same columns execute concurrently. So while core1 is calculating disparity from imgBuf1, core0 is doing blob detection on the disparity calculated from imgBuf0, then getting the next images from the camera, and so on. Notice that, when we do blob detection, and decide that the car needs to slow down or speed up, we change the maximum number of disparities. However, by the time we are ready to communicate this decision, the system has already calculated disparity for the next frame using the existing max number of disparities, so we need to invalidate the data in that dispBuf, as it no longer represents the current situation.
In every iteration of our main program loop, we are able to process two new frames. And so with this system in place, we can get 15 – 16 frames per second, almost double of what we measured as the stand-alone time for calculating disparity + blob detection earlier!
With our new concurrent system, we are able to acquire and pre-process images while calculating disparities at the same time. We avoid unnecessary overhead by making use of a two way buffering system. Since we know that the bottleneck of the system is Calculating disparity, we can offload all of the other work to Core0 so it doesn’t have to wait as long for the buffers to be depleted. We have managed to get about 15 – 16 frames, which is good enough for real-time stereo vision, and is approximately 3 times faster than our previous sequential, single core system.