In this tutorial, we will compare the performance of the forEach method of the Mat class to other ways of accessing and transforming pixel values in OpenCV. We will show how forEach is much faster than naively using the at method or even efficiently using pointer arithmetic.
There are hidden gems inside OpenCV that are sometimes not very well known. One of these hidden gems is the forEach method of the Mat class that utilizes all the cores on your machine to apply any function at every pixel.
Let us first define a function complicatedThreshold. It takes in an RGB pixel value and applies a complicated threshold to it.
// Define a pixel
typedef Point3_<uint8_t> Pixel;
// A complicated threshold is defined so
// a non-trivial amount of computation
// is done at each pixel.
void complicatedThreshold(Pixel &pixel)
{
if (pow(double(pixel.x)/10,2.5) > 100)
{
pixel.x = 255;
pixel.y = 255;
pixel.z = 255;
}
else
{
pixel.x = 0;
pixel.y = 0;
pixel.z = 0;
}
}
This function is computationally much heavier compared to a simple threshold. This way we are not just testing pixel access time but also how forEach uses all the cores when each pixel operation is computationally heavy.
Next, we will go over four different ways of applying this function to every pixel in an image and examine the relative performance.
Method 1 : Naive Pixel Access Using the at Method
The Mat class has a convenient method called at to access a pixel at location (row, column) in the image. The following code uses the at method to access every pixel and applies complicatedThreshold to it.
// Naive pixel access
// Loop over all rows
for (int r = 0; r < image.rows; r++)
{
// Loop over all columns
for ( int c = 0; c < image.cols; c++)
{
// Obtain pixel at (r, c)
Pixel pixel = image.at<Pixel>(r, c);
// Apply complicatedTreshold
complicatedThreshold(pixel);
// Put result back
image.at<Pixel>(r, c) = pixel;
}
}
The above method is considered inefficient because the location of a pixel in memory is being calculated every time we call the at method. This involves a multiplication operation. The fact that the pixels are located in a contiguous block of memory is not used.
Method 2 : Pixel Access Using Pointer Arithmetic
In OpenCV, all pixels in a row are stored in one continuous block of memory. If the Mat object is created using the create, ALL pixels are stored in one contiguous block of memory. Since we are reading the image from disk and imread uses the create method, we can simply loop over all pixels using simple pointer arithmetic that does not require a multiplication.
The code is shown below.
// Using pointer arithmetic
// Get pointer to first pixel
Pixel* pixel = image1.ptr<Pixel>(0,0);
// Mat objects created using the create method are stored
// in one continous memory block.
const Pixel* endPixel = pixel + image1.cols * image1.rows;
// Loop over all pixels
for (; pixel != endPixel; pixel++)
{
complicatedThreshold(*pixel);
}
Method 3 : Using forEach
The forEach method of the Mat class, takes in a function operator. The usage is
void cv::Mat::forEach (const Functor &operation)
The easiest way to understand the above usage is by way of an example shown below. We define a function object ( Operator ) for use with forEach.
// Parallel execution with function object.
struct Operator
{
void operator ()(Pixel &pixel, const int * position) const
{
// Perform a simple threshold operation
complicatedThreshold(pixel);
}
};
Calling forEach is straightforward and is done in just one line of code
// Call forEach
image2.forEach<Pixel>(Operator());
Method 4 : Using forEach with C++11 Lambda
Some of you are looking at Method 3, shaking your head in disgust and shouting, “lambda, Lambda, LAMBDA!”
Well, here you go, C++11 junkie!
image3.forEach<Pixel>
(
[](Pixel &pixel, const int * position) -> void
{
complicatedThreshold(pixel);
}
);
Comparing Performance of forEach
The function complicatedThreshold was applied to all pixels of a large image of size 9000 x 6750 five times in a row. The 2.5 GHz Intel Core i7 processor, used in the experiment, has four cores. The following timings were obtained. Note that using forEach made the code about five times faster than using Naive Pixel Access or Pointer Arithmetic method.
Method Type | Time ( milliseconds ) |
---|---|
Naive Pixel Access | 6656 |
Pointer Arithmetic | 6575 |
forEach | 1221 |
forEach (C++11 Lambda) | 1272 |
I have been writing code in OpenCV for more than a decade and whenever I had to write optimized code that accessed a pixel, I used pointer arithmetic instead of the naive at method. However, while writing this post, I was shocked to find there does not seem to be much of difference between the two methods even for large images.
It will be interesting to add performance of OpenMP’s #omp parallel for collapse(2) clause in the above comparison.
Hi Neerav,
I think it will be same or similar in operation because forEach uses standard parallel frameworks listed below as per availability. OpenMP is the third on the list. See more here
1. Intel Threading Building Blocks (3rdparty library, should be explicitly enabled)
2. C= Parallel C/C++ Programming Language Extension (3rdparty library, should be explicitly enabled)
3. OpenMP (integrated to compiler, should be explicitly enabled)
4. APPLE GCD (system wide, used automatically (APPLE only))
5. Windows RT concurrency (system wide, used automatically (Windows RT only))
6. Windows concurrency (part of runtime, used automatically (Windows only – MSVC++ >= 10))
7. Pthreads (if available)
any python code?
Not at the moment but you are giving me ideas to write about it. In the python version, OpenCV uses numpy arrays and there are a few different ways of parallel element access in numpy.
hope to have that code sir .. tnx
This is very good post!
I have a question. Is this same as parallel_for_() in OpenCV?
Thanks.
Yes, internally it uses parallel_for_
You can see the code here at line 507.
Congrats about this post, i always use pointer access to optimize too, but from version 3 there is no much difference between navie and pointer. But the forEach is awesome performance! i’m going to start to use it from now 😀
Awesome blog!
Thanks, David. Do you know what changed in version 3 so the naive access became faster?
Thank you very much. I learned a lot
I am glad it was useful.
Thanks, It was really helpful. I executed using your tip on raspi platform. Previously i was using pointer arithmetic, and the task was taking 41ms. Using forEach now it takes 25ms, really a time saver.
Amit, thanks for letting me know.
So finally the speed up is only due to usage of parallelism in for_each.
If you use an openMP parallel for loop in the Naive Pixel Access and Pointer Arithmetic you will also have better performance.
But for_each is much more convenient, It use best parallel method for each platform (such as GCD for iOS and MacOS), and also handle non-contiguous memory of the matrix.
True that :).
I think everybody here must have used tbb::parallel_for .
In order to replicate the method with mat.forEach(), how would I go about. I am interested in getting the pixel index(r,c) for that specific pixel being accessed in the loop. How would I get the values for row & col while using mat.forEach().
As you said, OpenCV is full of hidden gems, thank you for highlighting it ! I am wondering if this could be used to apply a simple convolution. For instance a 3×3 window sliding over the image. Can you think of something I missed ?
Hugo, convolution requires multiple pixel access at the same time. So this won’t work. But convolution is a basic operation and very efficiently implemented in OpenCV using filter2D. You may want to look at the Transparent API for speedups. https://learnopencv.com/opencv-transparent-api/
Great post! So fast and so convenient!
Is there any way to visit multiple image concurrently(e.g blend two images) using this method?
Thanks. foreach can’t handle multiple images. But you can use the transparent API. https://learnopencv.com/opencv-transparent-api/
How to capture row and column value of pixel using forEach ??