A master wordsmith can tell a heart breaking story in just a few words.
For sale: baby shoes, never worn.
A great artist can do so much with so little! The same holds true for great programmers and engineers. They always seem to eek out that extra ounce of performance from their machines. This is what often differentiates a great product from a mediocre one and an exceptional programmer from a run of the mill coder. Such mastery appears magical, but dig a bit deeper and you will notice that the knowledge was available to everyone. Few chose to utilize it.
In this post we will unlock the most easy and probably the most important performance trick you can use in OpenCV 3. It is called the Transparent API ( T-api or TAPI ).
What is Transparent API ( T-API or TAPI ) ?
The Transparent API is an easy way to seamlessly add hardware acceleration to your OpenCV code with minimal change to existing code. You can make your code almost an order of magnitude faster by making a laughably small change.
Using Transparent API is super easy. You can get significant performance boost by changing ONE keyword.
Don’t trust me ? Here is an example of standard OpenCV code that does not utilize the transparent API. It reads an image, converts it to grayscale, applies Gaussian blur, and finally does Canny edge detection.
C++
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
Mat img, gray;
img = imread("image.jpg", IMREAD_COLOR);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray);
waitKey();
return 0;
}
Python
import cv2
img = cv2.imread("image.jpg", cv2.IMREAD_COLOR)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (7, 7), 1.5)
gray = cv2.Canny(gray, 0, 50)
cv2.imshow("edges", gray)
cv2.waitKey();
Let us see how the same code looks with Transparent API.
OpenCV Transparent API example
I have modified the code above slightly to utilize the Transparent API. The difference between the standard OpenCV code and one utilizing TAPI is highlighted below. Notice that all we had to do was to copy the Mat image to UMat ( Unified Matrix ) class and use standard OpenCV functions thereafter.
C++
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
UMat img, gray;
imread("image.jpg", IMREAD_COLOR).copyTo(img);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray);
waitKey();
return 0;
}
Python
import cv2
img = cv2.UMat(cv2.imread("image.jpg", cv2.IMREAD_COLOR))
imgUMat = cv2.UMat(img)
gray = cv2.cvtColor(imgUMat, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (7, 7), 1.5)
gray = cv2.Canny(gray, 0, 50)
cv2.imshow("edges", gray)
cv2.waitKey();
On my Macbook Pro this small change makes the code run 5x faster.
Note: It makes sense to use the Transparent API only when you are doing a few expensive operations on the image. Otherwise the overhead of moving the image to the GPU dominates the timing.
Let us quickly summarize the steps needed to use transparent API
Convert Mat to UMat. There are a couple of ways of doing this in C++.
C++
Mat mat = imread("image.jpg", IMREAD_COLOR);
// Copy Mat to UMat
UMat umat;
mat.copyTo(umat);
Alternatively, you can use getUMat
Mat mat = imread("image.jpg", IMREAD_COLOR);
// Get umat from mat.
UMat umat = mat.getUMat( flag );
Python
mat = cv2.imread("image.jpg", cv2.IMREAD_COLOR)
umat = cv2.UMat(mat)
flag can take values ACCESS_READ, ACCESS_WRITE, ACCESS_RW and ACCESS_FAST. At this point it is not clear what ACCESS_FAST does, but I will update this post once I figure it out. Use standard OpenCV functions that you would use with Mat.If necessary, convert UMat back to Mat.. Most of the time you do not need to do this. Here is how you do it in case you need to.
C++
Mat mat = umat.getMat( flag );
Python
mat = cv2.UMat.get(umat)
where umat is a UMat image. flag is the same as described
above.
Now we know how to use the Transparent API. So what is under the hood that magically improves performance ? The answer is OpenCL. In the section below I briefly explain OpenCL.
What is Open Computing Language (OpenCL) ?
If you are reading this article on a laptop or a desktop computer, it has a graphics card ( either integrated or discrete ) connected to the CPU, which in turn has multiple cores. On the other hand, if you are reading this on a cell phone or tablet, your device probably has a CPU, a GPU, and a Digital Signal Processor ( DSP ). So you have multiple processing units that you can use. The fancy industry words for your computer or mobile device is “heterogeneous platform”.
OpenCL is a framework for writing programs that execute on these heterogenous platforms. The developers of an OpenCL library utilize all OpenCL compatible devices (CPUs, GPUs, DSPs, FPGAs etc) they find on a computer / device and assign the right tasks to the right processor. Keep in mind that as a user of OpenCV library you are not developing any OpenCL library. In fact you are not even a user of the OpenCL library because all the details are hidden behind the transparent API.
What is the difference between OCL Module and Transparent API ?
Short answer : The OCL module is dead. Long live the Transparent API!
OpenCL was supported in OpenCV 2.4 via the OCL module. There were a set of functions defined under the ocl namespace that you could use to call the underlying OpenCL code. Below is an example for reading an image, and using OpenCL to convert it to grayscale.
// Example for using OpenCL is OpenCV 2.4
// In OpenCV 3 the OCL module is gone.
// It is replaced by the much nicer Transparent API
// Initialize OpenCL
std::vector<ocl::Info> param;
ocl::getDevice(param, ocl::CVCL_DEVICE_TYPE_GPU);
// Read image
Mat im = imread("image.jpg");
// Convert it to oclMat
ocl::oclMat ocl_im(im);
// Container for OpenCL gray image.
ocl::oclMat ocl_gray;
// BGR2GRAY using OpenCL.
cv::ocl::cvtColor( ocl_im, ocl_gray, CV_BGR2GRAY );
// Container for OpenCV Mat gray image.
Mat gray;
// Convert back to OpenCV Mat
ocl_gray.download(gray);
As you can see it was a lot more cumbersome. With OpenCV 3 the OCL module is gone! All this complexity is hidden behind the so-called transparent API and all you need to do is use UMat instead of Mat and the rest of the code remains unchanged. You just need to write the code once!
Doesn’t seem to have any difference for detectMultiScale but hope it works for other functions…
What kind of graphics card do you have on your machine ?
Onboard Intel graphics hd 2000, probably very weak, so that’s the answer?
That may be the case. When I use the onboard graphics card, the improvement is not huge.
I read on the documentation that the Transparent API uses gpu optimizations, and cuda, so maybe it needs some cuda cores to run better.
Intel HD 2000 does not support openCl. That could be the reason.
Hi Folks,
I am right now having a system with UBUNTU 16.04 LTS, OPENCV 3.4, Python 3.5, NVIDIA GEFORCE 1080 TI GPU,Intel i7 PROCESS.
Can I use UMat to process Videos using cv2.VideoCapture()? , Can you please let me know. since I couldn’t use CUDA 9.2 for Python.
Thanks
Guru
I encounter the same situation with u, it seems not works with detectMultiScale, I tested on a platform which can use openCL to acclerate openCV
Do I have to install OpenCL on my machine or is it bundled with OpenCV? I have a GeForce GTS 240. I tested some feature detection code and GPU load went from 0% to 2%, while GPU memory went up about 20MB, calculation time is about the same.
Is there a list of what OpenCV features support the T-API? I’ve tested some code using features2d and the GPU load only went up 2% (the card is an old GeForce GTS 240).
In the sourcesmodulesfeatures2dsrcopencl folder there are only three files: brute_force_matcher.cl, fast.cl and orb.cl, I’m guessing only these have OpenCL implementations, am I right?
I don’t think there is a list. There are 67 opencl ( .cl ) files in different modules, and these are be used by many different functions. So it is tough to say. BTW you can force OpenCV to use a particular device by setting the environment variable OPENCV_OPENCL_DEVICE. E.g. export OPENCV_OPENCL_DEVICE=:GPU:0
Do you know how to use T-API in Python, I’ve also posted the question over on SO but it’s getting barely any attention.
http://stackoverflow.com/questions/31990646/using-opencl-accelerated-functions-with-opencv3-in-python
You may already know the answer to this. For other readers, python bindings still don’t use TAPI
https://github.com/Itseez/opencv/issues/5043
Now it does, may be you can update your post as it is the first find in google …
Thanks, Jeff.
I have updated the post. Any idea how to use the flag parameter in Python? I could not find the documentation and the few things I tried did not work.
Hi Satya, I think there’s a bug in your Python example:
img = cv2.UMat(cv2.imread(“image.jpg”, cv2.IMREAD_COLOR))
imgUMat = cv2.UMat(img)
You make this imgUMat and then don’t use it.
Thanks for this though! I’ve been wating for T-API in python for 2 years, and haven’t been checking in often enough. Thanks for the email!
Thanks a bunch! I have fixed it.
Satya, thanks for the explaination, To use this feature, I suppose, the cmake makefile generator should enable compiling WITH_OPENCL, though you dont need to directly invoke OpenCL from code.
Yes I think so. Sorry for this late reply.
That is .. fun~
https://i.stack.imgur.com/kicIw.jpg
haha… that is very odd indeed. But I have seen such behavior before. Could be a variety of factors like GPU is busy etc.
Hi all,
I’m observing similar results on my machine (Ubuntu 16.04, GPU Nvidia NVS5200M ) with your code and the typical lena.png image as test input
Time Mat: 0.15s
Time UMat: 0.30s
These results are consistent over multiple test runs.
Do you have any ideas? Maybe the GPU help would pay off only for larger images?
Thank you!
great article as usual.
Thanks!
Hmmm, how come the fresh post has comments that are 3 years old? Sorry for that offtopic, but it just drew my attention.
I assume one must compile the OpenCV library with CUDA and OpenCL support for that trick to work?
It looks like python didn’t add T-API until a few months ago so he updated the article. Previously it only had C++. I’m not sure what new update he has done since then. But i think that’s the reason the article is “new” is because an update/change
Yes. That’s right, because there was an important change, I wanted to make sure people understand the information is not stale.
CUDA is not necessary. OpenCL is.
Thanks a lot Satya. I’ve run some tests varying image size and number of operations per image (number of consecutive Gaussian blur’s, for example) and I can confirm that on a 16 GB I7 with a 2GB 1050 NVIDIA I get consistently between 3X-6X. I’ve also measured the transport/conversion from mat 2 umat 2 mat and I found that it accounts for around 10% of the speed-up on my system.
So, quite satisfying results.
Thanks,
Lubo
https://uploads.disquscdn.com/images/1707b631adf37c54d610526542d14ad03ccd5be191cc401e45d22e3e96fd6595.png
Nice!
I received this article from email. After using your code/your example, I didn’t understand how do Umat faster than normal?
https://uploads.disquscdn.com/images/b370f7dda7f8be14da4f6ea7603ea1e8304d4c34ecdf579001de81ab08f6224b.png
T-API needs a bit of knowledge about what you are doing.
You must not use UMat to perform a “single” operation on an image, the “memory bottleneck” affects the timing to much! UMat (as all the operation made with the GPU) must be used when you have a “long” pipeline of operations to be applied to an image.
When you code on GPU you must take the CPU-memory that contains the image and copy it to the GPU-memory… this often takes a lot compared to a single operation performed on a CPU.
The only way to understand when using T-API is convenient is making a bit of tests applying a variable number of filters on the original image.
Based on your question, I can not apply umat for raspberry? Thank for answering!
If RPi has OpenCL yes… but I do not know if OpenCL driver is available on RPi
Thanks, Walter. I think I should add a section about this in the post. A couple of people have already asked me this question.
About “ACCESS_FAST”… it seems that using that flag all the “memcpy” code is avoided, increasing speed.
I’m not sure what this implies… I think that you can use it only when your src and dst image are shared. Otherwise you modify the original image.
The flag is used only in two “if” without a relative “else” branch:
https://github.com/opencv/opencv/blob/1389fd67ab4f164f71877be499ff7da4d38b797d/modules/core/src/ocl.cpp#L4626
and
https://github.com/opencv/opencv/blob/1389fd67ab4f164f71877be499ff7da4d38b797d/modules/core/src/ocl.cpp#L4673
A little more of documentation wouldn’t be bad 😉
不错不错!内容感觉好极了!
https://uploads.disquscdn.com/images/dd36dc5281f8595ef160804271c0e755246d7382e7aceef013eab6bcda9073ed.png Hello, Satya, great arcicle! Simple and clear. But I ran into a problem that I could not solve for a long time!!!
Problem: I used the UMat::getMat() method to copy the matrix from the GPU to the CPU memory. And on further processing, I encountered a message about a memory leak (CV_Assert(u->refcount == 0 && “UMat deallocation error: some derived Mat is still alive”) in ocl.cpp, line 4730).
In the official documentation there is no information about a similar problem, but in a document from Intel
https://software.intel.com/sites/default/files/managed/2f/19/inde_opencv_3.0_arch_guide.pdf)
I found very useful information that helped me solve the problem. Could you mention some peculiarities of using cv::UMat from this document (pdf link above) in your this article. This would help beginners avoid problems. Thank you very much for useful articles!
Hi,Satya:
Recntly, I am confused by using UMat and Mat,when I test an openCV API “remap” function on TI’s platform which can use openCL to accelerate openCV , I found that when define the src_img and dst_img by using Mat,the remap cost about 22 ms,and save the dst_img to jpeg file cost about 55 ms, but when define the src_img and dst_img by using UMat, the remap cost only 0.239608 ms while saving the dst_img to jpeg file cost about 1.12 s, can u tell me why use UMat the file-saving-time increases dramticlly?
Is there any mutex or other kind of lock protect UMat convert to Mat?
# loop over the contours individually
for c in cnts:
# if the contour is not sufficiently large, ignore it
if cv2.contourArea(c) < 100:
continue
How to convert this code in c++
I tried to use bilateralFilter, it does not give any runtime errors, but it gives just a black image as result. This was tested under Linux with Nvidia GTX 750.
When disabling the GPU via cv::ocl::setUseOpenCL(false) the bilateral filter works again.
I tried the same program on a Windows Laptop with an Intel HD graphics 520 and there it works with and without GPU activation.
If anyone has an idea why it doesn’t work on Linux with the GTX 750, let me know.
In any case thanks for the article, it’s always a pleasure to read them!
Best regards
Wim.
Thanks for the great article, just a quick question, as you say “all you need to do is use UMat instead of Mat” but when I do this:
img = cv2.UMat(frame)
imageUMat = cv2.UMat(img)
image = cv2.cvtColor(imageUMat, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))
it gives me “AttributeError: ‘cv2.UMat’ object has no attribute ‘reshape'”. Any idea on how to solve that?
Folks,
I am right now having a system with UBUNTU 16.04 LTS, OPENCV 3.4, Python 3.5, NVIDIA GEFORCE 1080 TI GPU,Intel i7 PROCESS.
Can I use UMat to process Videos using cv2.VideoCapture()? , Can you please let me know. since I couldn’t use CUDA 9.2 for Python.
Thanks
Guru
Hi great tutorial!
I have tried it with ORB detector/extractor and sparse optical flow(lucas kanade) and there is no improvement in speed and it even runs slower. Tried on Nvidia GTX 1050. I have read source codes of that functions and they have opencl optimization. So why is that slower?
And also can I use T-api with Android opencv compiled with WITH_OPENCL? I need to run that two functions faster.
Thanks.