Performance and System Optimisation of complete Stereo Vision program

Previously, I have written about the performance of different algorithms, and explained why we chose the OpenCV implementation of Block Matching (StereoBM). I have also written about some results we obtained after trying as yet unpublished technique. At that stage, we were able to achieve 7.5 frames per second with a maximum number of disparities of 80, and image size of 320 x 240.

After implementing the entire system around it, like capturing images from camera, doing blob detection and display, and taking it into account, we were getting a paltry 3 – 4 fps.

Initial Optimisations

  1. Decreased the maximum number of disparities from 80 to 48. This increases the distance to the closest object that we can detect to about 1m, but that is acceptable, as anything closer would probably be within the car’s stopping distance anyway, and so there would be a crash regardless of whether we detect it or not.
  2. Cropped the image to reduce the peripheral information in the image. The decisions should be made by what’s directly in front of the car, so there is no point in wasting time doing computations on the sky.

With these changes, we were able to achieve around 5 fps. A small improvement, but an improvement nonetheless, so we have made these changes permanent.

In-depth Performance Analysis

To understand what was eating up the CPU time, I measured the time taken for all of the individual steps in the process per iteration (i.e. one stereo image set). Bearing in mind that all of the steps are happening sequentially, the numbers (in fps) are as follows:

Acquiring Images Pre-processing images Calculating Disparity + Blob Detection Overall
25 24 8.5 5

Just an aside – we have recently discovered that acquiring images is dependent on the light level as well, because our camera is a bit rubbish. Assuming an adequate quantity of light, however, we see that we can acquire images at 25 fps from the camera, pre-process them in real-time (if happening in parallel). I combined Calculating Disparity and Blob detection because Blob Detection is very fast. Since all this is happening sequentially, and there are other smaller overheads, the overall speed turns out to be 5 fps.

System Optimisations

The Pandaboard ES has dual core ARM Cortex –A9 with the cores running at 1.2GHz each. At this stage, we are using one core for all of the process, and one core for hosting a server. The server core is mostly idle. We need to ignore the server for the moment, and modify our system to introduce some serious concurrency!

Implementation option 1

The simplest way to do this would be to acquire images and pre-process them on one core, and calculate the disparity and blob detection on the second core. This can be achieved by creating two storage buffers, buf0 and buf1. The cameras grab the image, and after the pre-processing, we store it in buf0. Buf0 can then be copied to buf1, and we can start doing disparity calculations on buf1. While this is happening, the first thread can refresh the images from the camera in buf0, and so on, so that acquiring + pre-processing images and calculating disparity + blob detection happen concurrently. Here is the idea in diagram form:

There are some drawbacks to this approach however, primarily the fact that we have to copy. Also, as we established before, image acquisition + pre-processing happens faster than calculating disparity, so we would have to frequently wait for disparity calculation to finish, before being able to copy data into buf1. This would affect performance significantly.

Implementation option 2

The second way to do this is a little more complicated, so naturally, that is what we have implemented. Instead of having a separate buffer for storing images, then copying them for calculating disparity, we now have a two way buffer system. We have two buffers that are being shared between the two cores, and we write alternatively to them. So we have two buffers, buf0, and buf1. We first grab the images from the camera and store them in buf0. We then grab the next pair of images and store them in buf1. While we are working on getting the next pair of images, we start calculating on buf0. Once that is complete, we start calculating on buf1, and storing images in buf0. This way, we deal with two sets of frames in one loop iteration, and avoid any overhead associated with copying the data.

I should mention that we are using OpenMP for parallelisation. We need to make sure that only one process is access any of the buffers at one time. We don’t want to be calculating on buf0 while grabbing and storing images in it at the same time! This would result in all sorts of horribleness. To avoid this issue, we use the concept of mutual exclusion, and make use of OpenMP critical directive. While a section of code is surrounded by a #pragma omp critical(name), only one thread is allowed to execute inside that region. So all of the areas of code which access buf0 wil be surrounded by #pragma omp critical(buffer0). If a thread reaches that statement, while the other thread is inside the critical region, that thread will be forced to wait until the other one has exited.

Figure 3 Diagramatically: the system only progresses when both conditions are met for a transition to take place.

Initially, before we get inside the loop for an indefinite amount of time, we need to fill the buffers.

The program has two distinct critical sections, one for each buffer. They occur on both cores, and the execution of the program is as in the animated diagram:

Now that we have the basic system in place, we need to take care of Blob detection. As we know, image acquisition is faster than disparity calculation, so to avoid Core 0 to wait long, we should give it as much work to do as possible. As such, we’ve decided to do the Blob Detection and decision making on core 0. To do this, the output of the disparity map is stored by core1 in a buffers dispBuf0, and dispBuf1. Before getting the next image from the camera, we perform blob detection on the previous disparity map. This is best explained with another diagram:


The different colours represent the two critical sections. The two rows inside the box are the processes happening on the two cores. The blue and orange boxes that are in the same columns execute concurrently. So while core1 is calculating disparity from imgBuf1, core0 is doing blob detection on the disparity calculated from imgBuf0, then getting the next images from the camera, and so on. Notice that, when we do blob detection, and decide that the car needs to slow down or speed up, we change the maximum number of disparities. However, by the time we are ready to communicate this decision, the system has already calculated disparity for the next frame using the existing max number of disparities, so we need to invalidate the data in that dispBuf, as it no longer represents the current situation.

In every iteration of our main program loop, we are able to process two new frames. And so with this system in place, we can get 15 – 16 frames per second, almost double of what we measured as the stand-alone time for calculating disparity + blob detection earlier!

Conclusion

With our new concurrent system, we are able to acquire and pre-process images while calculating disparities at the same time. We avoid unnecessary overhead by making use of a two way buffering system. Since we know that the bottleneck of the system is Calculating disparity, we can offload all of the other work to Core0 so it doesn’t have to wait as long for the buffers to be depleted. We have managed to get about 15 – 16 frames, which is good enough for real-time stereo vision, and is approximately 3 times faster than our previous sequential, single core system.

Aaaaaaa! Working with the pandaboard is so annoying!

“I have a secret vendetta against you, muhahaha”

This evening, I have been working on getting our stereo vision algorithms working on the pandaboard (so far, I’ve been testing them on my laptop). All well and good, I thought. But oh no, problems, as ever, arose by the tonne.

Firstly, some of the libraries that I was using (concurrency, time) had to be removed, because they are part of Visual C++, and not vanilla C++. The time library was just being used to test performance, so I’m going to have to come up with another solution for that, but that’s only a minor niggle.

The concurrency library, however, is what I was using for parallelisation. Thankfully, I haven’t really done any parallelisation of the algorithms (I had just done acquiring images from the camera in parallel), but we were planning on parallelising some other stuff. I suppose we’ll have to learn OpenMP as well or something, but maybe that’s not such a bad thing.

The biggest issue, however, is that the USB port on the pandaboard doesn’t have enough bandwidth to support streaming from both cameras simultaneously at a resolution of 640 x 480…

I kept getting mysterious errors, but eventually worked out the issue. Alright, I thought, we’ll capture the images at 320 x 240 instead. It will speed up the algorithms too, albeit make our visual demonstration a bit less fantastical.

The next, as yet unsolved problem, is that all the camera calibration was done at 640 x480, and using those settings with 320 x 240 images results in just plain incorrect outputs. So I need to work out whether it is possible to scale the matrices appropriately (I think it is), or redo the calibration process at 320 x 240.

I was hoping to get this finished today, but ah well, almost there. On the plus side, the algorithm seemed to be running at a decent frame rate (it can run but the output is obviously wrong) on the board (not that I could test the performance numerically).

Can’t wait for portable supercomputers.

Hassan

Awaken the Panda (Board)

This is a long overdue post, mainly my fault for being too focused on the technical stuff. Anyway, this post will give you a head start on making the Pandaboard work.

Installing Ubuntu

This part of the post covers on how to get Ubuntu up and running on the Pandaboard.

http://www.omappedia.com/wiki/Ubuntu_Pre-built_Binaries_Guide

The link above covers the installation process step-by-step and is pretty comprehensive in my opinion. However, I’m still going to walk through the installation process.

For our project, the hardware we used to install Ubuntu are covered below:

  • PandaBoard (Upgrading to the ES version when we get it)
  • SanDisk Secure Digital High Capacity Card Extreme Video HD 16GB Class10                                (generally Class 10 SD cards are recommended)
  • SD Card Reader

To get started, I would recommend to download the Ubuntu operation system on your computer and install it. This is recommended for Windows user as it is more convenient when backing up images of the SD card. Virtual Machine should be fine if you don’t want to dual-boot.

After that, download the pre-built binaries from the links given below:

Ubuntu 12.04

http://cdimage.ubuntu.com/releases/12.04/release/ubuntu-12.04-preinstalled-desktop-armhf+omap4.img.gz

Moving on, you need to put the image into the SD card. I had trouble trying to follow the instructions as it seems to be missing on how to prepare the SD card for imaging. All you need to do is to delete any partition in the SD card. I used GParted Disk Editor to delete the partition.

Linux
Steps:
1. Place the SD card at your host computer.
2. Make sure the SD card is not mounted (just umount it if needed)
3. Identify the correct raw device name (like /dev/sde – not /dev/sde1, can be found in GParted)
4. Run the following command to write it:                                                                                                                   (replacing sde with the right card reader id; my card reader was /dev/mmcblk0)

  1.  gunzip ubuntu-12.XX-preinstalled-desktop-armhf+omap4.img.gz
  2. sudo dd bs=4M if=ubuntu-12.04-preinstalled-desktop-armhf+omap4.img of=/dev/sde
  3. sudo sync

Once done, you’re ready to boot up your PandaBoard. Plug the SD card into the SD card slot of the PandaBoard, turn on the power, and prepare to be amazed.

Follow the installation instructions on the screen. The installation is rather straight forward. Therefore, I won’t cover on this.

Backing up SD Card
Installation process can be quite a hassle. Other than that, you might want to backup personal settings or files. Therefore, it is essential to backup especially if you are working on a project. Backing up is a simple and hassle free process.

Instructions as follow:

  1. Open a Terminal in Ubuntu on your PC.
  2. Before that, you have to identify the identity of your SD Card in your card Reader.
  3. In the terminal, key in the following, replacing sde with id of card:
    dd if=/dev/sde of=./[Name of backup].img

Restoring image to SD Card

  1. delete any partition in the SD card.
  2. open terminal, run the following commands, replacing sde with id of card:
    dd bs=4M if=[Name of backup].img of=/dev/sde

That’s all for getting the PandaBoard to work. The only driver you need to install is the PowerVR Graphics driver. You should be notified by Ubuntu to install it.

Good Luck with the Panda :DD

– Chee –

Minor Setback

We had Ubuntu 12.04 running on the Pandaboard just fine a few days ago, but decided to upgrade to 12.10 because of supposed performance improvements. Unfortunately, the graphics driver fails to load when you try and login now.. We’ve decided to revert to a backup of 12.04, but that has cost us 3 or 4 days of work..

Installing OpenCV 2.4.3 on Ubuntu

As mentioned previously, this is applicable to the pandaboard (with its ARM process), the Virtual Machine, and any installation on an X86 system.

This was quite hard to figure out as there are a lot of dependencies that need to be installed, and the documentation is quite poor and not up to date, but all credit to Chee who spent a lot of time researching and coming up with a way.

Basically, someone has written a script which you can find here, and execute from a terminal, and it sorts everything out. The problem is that it installs version 2.4.2, and we wanted to install 2.4.3, so Chee modified the script accordingly.

It is given below. You can paste it in a text editor, save it as opencv2_4_3.sh and execute from a terminal with the following command

$ sh opencv2_4_3.sh

And everything will magically sort out itself. A word of caution though, I had an issue when executing the script because the line endings were of the wrong type. If that happens, simply change the type of line-endings used for saving via the text-editor, and you should be good to go (Thanks, Chee!).

arch=$(uname -m)
if [ "$arch" == "i686" -o "$arch" == "i386" -o "$arch" == "i486" -o "$arch" == "i586" ]; then
flag=1
else
flag=0
fi
echo "Installing OpenCV 2.4.3"
mkdir OpenCV
cd OpenCV
echo "Removing any pre-installed ffmpeg and x264"
sudo apt-get remove ffmpeg x264 libx264-dev
echo "Installing Dependenices"
sudo apt-get install libopencv-dev
sudo apt-get install build-essential checkinstall cmake pkg-config yasm
sudo apt-get install libtiff4-dev libjpeg-dev libjasper-dev
sudo apt-get install libavcodec-dev libavformat-dev libswscale-dev libdc1394-22-dev libxine-dev libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev libv4l-dev
sudo apt-get install python-dev python-numpy
sudo apt-get install libtbb-dev
sudo apt-get install libqt4-dev libgtk2.0-dev
echo "Downloading x264"
wget ftp://ftp.videolan.org/pub/videolan/x264/snapshots/x264-snapshot-20121107-2245-stable.tar.bz2
tar -xvf x264-snapshot-20121107-2245-stable.tar.bz2
cd x264-snapshot-20121107-2245-stable/
echo "Installing x264"
if [ $flag -eq 1 ]; then
./configure --enable-static
else
./configure --enable-shared --enable-pic
fi
make
sudo make install
cd ..
echo "Downloading ffmpeg"
wget http://ffmpeg.org/releases/ffmpeg-0.11.2.tar.bz2
echo "Installing ffmpeg"
tar -xvf ffmpeg-0.11.2.tar.bz2
cd ffmpeg-0.11.2/
if [ $flag -eq 1 ]; then
./configure --enable-gpl --enable-libfaac --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libtheora --enable-libvorbis --enable-libx264 --enable-libxvid --enable-nonfree --enable-postproc --enable-version3 --enable-x11grab
else
./configure --enable-gpl --enable-libfaac --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libtheora --enable-libvorbis --enable-libx264 --enable-libxvid --enable-nonfree --enable-postproc --enable-version3 --enable-x11grab --enable-shared
fi
make
sudo make install
cd ..
echo "Downloading v4l"
wget http://www.linuxtv.org/downloads/v4l-utils/v4l-utils-0.8.9.tar.bz2
echo "Installing v4l"
tar -xvf v4l-utils-0.8.9.tar.bz2
cd v4l-utils-0.8.9/
make
sudo make install
cd ..
echo "Downloading OpenCV 2.4.3"
wget -O OpenCV-2.4.3.tar.bz2 http://sourceforge.net/projects/opencvlibrary/files/opencv-unix/2.4.3/OpenCV-2.4.3.tar.bz2/download
echo "Installing OpenCV 2.4.3"
tar -xvf OpenCV-2.4.3.tar.bz2
cd OpenCV-2.4.3
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE ..
make
sudo make install
sudo echo "/usr/local/lib" >> /etc/ld.so.conf
sudo ldconfig
echo "OpenCV 2.4.3 ready to be used"

Chee and Hassan