If you have been working with OpenCV for some time, you should have noticed that in most scenarios OpenCV utilizes CPU, which doesn’t always guarantee you the desired performance. To tackle this problem, in 2010 a new module that provides GPU acceleration using CUDA was added to OpenCV. You can find a benchmark demonstrating the advantage of the GPU module below:
To find out the benchmark details, you can refer to the Realtime Computer Vision with OpenCV article.
Overview
Let’s briefly list what we will do in this post:
- Overview OpenCV modules that already have support for CUDA.
- Take a look at the basic block
cv::gpu::GpuMat
(cv2.cuda_GpuMat
). - Learn how to transfer data between CPU and GPU.
- Learn how to utilize multiple GPUs.
- Write a simple demo (both C++ and Python) to get to know the CUDA support API provided by OpenCV and to calculate the performance boost we can gain.
Supported Modules
Even though not all the library’s functionality is covered, it is claimed, that
the module “still keeps growing and is being adapted for the new computing technologies and GPU architectures.”
Let’s take a look at the official documentation of the CUDA-accelerated OpenCV. Here we can see listed modules that are already supported:
- Core part
- Operations on Matrices
- Background Segmentation
- Video Encoding/Decoding
- Feature Detection and Description
- Image Filtering
- Image Processing
- Legacy support
- Object Detection
- Optical Flow
- Stereo Correspondence
- Image Warping
- Device layer
Basic Block – GpuMat
To keep data in GPU memory, OpenCV introduces a new class cv::gpu::GpuMat
(or cv2.cuda_GpuMat
in Python) which serves as a primary data container. Its interface is similar to cv::Mat
(cv2.Mat
) making the transition to the GPU module as smooth as possible. Another thing worth mentioning is that all GPU functions receive GpuMat as input and output arguments. You can reduce the overhead on copying the data between the CPU and GPU with such design having chained GPU algorithms in your code.
CPU/GPU Data Transfer
To transfer data from GpuMat to Mat and vice-versa, OpenCV provides two functions:
- upload, which copies data from host memory to device memory
- download, which copies data from device memory to host memory.
Below is a simple example in C++ of their usage in a context:
#include <opencv2/highgui.hpp>
#include <opencv2/cudaimgproc.hpp>
cv::Mat img = cv::imread("image.png", IMREAD_GRAYSCALE);
cv::cuda::GpuMat dst, src;
src.upload(img);
cv::Ptr<cv::cuda::CLAHE> ptr_clahe = cv::cuda::createCLAHE(5.0, cv::Size(8, 8));
ptr_clahe->apply(src, dst);
cv::Mat result;
dst.download(result);
cv::imshow("result", result);
cv::waitKey();
And the same example in Python:
img = cv2.imread("image.png", cv2.IMREAD_GRAYSCALE)
src = cv2.cuda_GpuMat()
src.upload(img)
clahe = cv2.cuda.createCLAHE(clipLimit=5.0, tileGridSize=(8, 8))
dst = clahe.apply(src, cv2.cuda_Stream.Null())
result = dst.download()
cv2.imshow("result", result)
cv2.waitKey(0)
Utilizing Multiple GPUs
By default, each of the OpenCV CUDA algorithms uses a single GPU. If you need to utilize multiple GPUs, you have to manually distribute the work between GPUs. To switch active device use cv::cuda::setDevice (cv2.cuda.SetDevice)
function.
Sample Demo
OpenCV provides samples on how to work with already implemented methods with GPU support using C++ API. But not so much information comes up when you want to try out Python API, which is also supported. Let’s implement a simple demo on how to use CUDA-accelerated OpenCV with C++ and Python API on the example of dense optical flow calculation using Farneback’s algorithm.
We will first take a look at how this could be done using the CPU. Then we will do the same using GPU. And finally, we are going to compare the elapsed time to calculate the gained speedup. Check out the README.md file with proper installation instructions before you start if you’d like to run the code yourself.
FPS Calculation
Since our primary goal is to find out how fast the algorithm works on different devices, we need to choose how we can measure it. A common way of doing so in the Computer Vision field is to calculate the number of processed frames per second (FPS). You can take a look at our earlier post for a quick reminder of how it could be done.
CPU Pipeline
1. Video and Its Attributes
We will start with video capture initialization and getting its attributes such as frame rate and a number of its frames. This part is common for CPU and GPU part:
Python
# init video capture with video
cap = cv2.VideoCapture(video)
# get default video FPS
fps = cap.get(cv2.CAP_PROP_FPS)
# get total number of video frames
num_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
C++
// init video capture with video
VideoCapture capture(videoFileName);
if (!capture.isOpened())
{
// error in opening the video file
cout << "Unable to open file!" << endl;
return;
}
// get default video FPS
double fps = capture.get(CAP_PROP_FPS);
// get total number of video frames
int num_frames = int(capture.get(CAP_PROP_FRAME_COUNT));
2. Reading the First Frame
Because of the specificity of the algorithm, that uses two frames for calculation, we need to read the first frame before we move on. Some pre-processing is also needed such as resizing and converting to grayscale:
Python
# read the first frame
ret, previous_frame = cap.read()
if device == "cpu":
# proceed if frame reading was successful
if ret:
# resize frame
frame = cv2.resize(previous_frame, (960, 540))
# convert to gray
previous_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# create hsv output for optical flow
hsv = np.zeros_like(frame, np.float32)
# set saturation to 1
hsv[..., 1] = 1.0
C++
// read the first frame
cv::Mat frame, previous_frame;
capture >> frame;
if (device == "cpu")
{
// resize frame
cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);
// convert to gray
cv::cvtColor(frame, previous_frame, COLOR_BGR2GRAY);
// declare outputs for optical flow
cv::Mat magnitude, normalized_magnitude, angle;
cv::Mat hsv[3], merged_hsv, hsv_8u, bgr;
// set saturation to 1
hsv[1] = cv::Mat::ones(frame.size(), CV_32F);
You may notice, that we’ve also created an output frame, which we will use later.
3. Reading and Pre-processing Other Frames
Before reading the rest frames in a loop, we start two timers: one will track the full pipeline working time, the second one – reading frame time. Since Farneback’s Optical Flow algorithm works with grayscale frames, we need to make sure, we’re passing a grayscale video as an input. That’s why we first pre-process it to convert each frame from BGR format to grayscale. Also, since the original resolution might be too large, we resize it to a smaller size the same way as we did it for the first frame. We set up one more timer to calculate the time spent on the pre-processing stage:
Python
while True:
# start full pipeline timer
start_full_time = time.time()
# start reading timer
start_read_time = time.time()
# capture frame-by-frame
ret, frame = cap.read()
# end reading timer
end_read_time = time.time()
# add elapsed iteration time
timers["reading"].append(end_read_time - start_read_time)
# if frame reading was not successful, break
if not ret:
break
# start pre-process timer
start_pre_time = time.time()
# resize frame
frame = cv2.resize(frame, (960, 540))
# convert to gray
current_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# end pre-process timer
end_pre_time = time.time()
# add elapsed iteration time
timers["pre-process"].append(end_pre_time - start_pre_time)
C++
while (true)
{
// start full pipeline timer
auto start_full_time = high_resolution_clock::now();
// start reading timer
auto start_read_time = high_resolution_clock::now();
// capture frame-by-frame
capture >> frame;
if (frame.empty())
break;
// end reading timer
auto end_read_time = high_resolution_clock::now();
// add elapsed iteration time
timers["reading"].push_back(duration_cast<milliseconds>(end_read_time - start_read_time).count() / 1000.0);
// start pre-process timer
auto start_pre_time = high_resolution_clock::now();
// resize frame
cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);
// convert to gray
cv::Mat current_frame;
cv::cvtColor(frame, current_frame, COLOR_BGR2GRAY);
// end pre-process timer
auto end_pre_time = high_resolution_clock::now();
// add elapsed iteration time
timers["pre-process"].push_back(duration_cast<milliseconds>(end_pre_time - start_pre_time).count() / 1000.0);
4. Calculating Dense Optical Flow
We use the corresponding method called calcOpticalFlowFarneback to calculate a dense optical flow between two frames:
Python
# start optical flow timer
start_of = time.time()
# calculate optical flow
flow = cv2.calcOpticalFlowFarneback(
previous_frame, current_frame, None, 0.5, 5, 15, 3, 5, 1.2, 0,
)
# end of timer
end_of = time.time()
# add elapsed iteration time
timers["optical flow"].append(end_of - start_of)
C++
// start optical flow timer
auto start_of_time = high_resolution_clock::now();
// calculate optical flow
cv::Mat flow;
calcOpticalFlowFarneback(previous_frame, current_frame, flow, 0.5, 5, 15, 3, 5, 1.2, 0);
// end optical flow timer
auto end_of_time = high_resolution_clock::now();
// add elapsed iteration time
timers["optical flow"].push_back(duration_cast<milliseconds>(end_of_time - start_of_time).count() / 1000.0);
We wrap its usage in-between two timers calls, again, to calculate the elapsed time.
5. Post-processing
Farneback’s Optical Flow algorithm output a two-dimensional flow vector. We convert these outputs to polar coordinates to obtain the angle (direction) of flow by hue and the distance (magnitude) of flow by value of HSV color representation. For visualization, all we have left to do now is to convert the result to BGR space. After that we stop all the remained timers to get the elapsed time:
Python
# start post-process timer
start_post_time = time.time()
# convert from cartesian to polar coordinates to get magnitude and angle
magnitude, angle = cv2.cartToPolar(
flow[..., 0], flow[..., 1], angleInDegrees=True,
)
# set hue according to the angle of optical flow
hsv[..., 0] = angle * ((1 / 360.0) * (180 / 255.0))
# set value according to the normalized magnitude of optical flow
hsv[..., 2] = cv2.normalize(
magnitude, None, 0.0, 1.0, cv2.NORM_MINMAX, -1,
)
# multiply each pixel value to 255
hsv_8u = np.uint8(hsv * 255.0)
# convert hsv to bgr
bgr = cv2.cvtColor(hsv_8u, cv2.COLOR_HSV2BGR)
# update previous_frame value
previous_frame = current_frame
# end post-process timer
end_post_time = time.time()
# add elapsed iteration time
timers["post-process"].append(end_post_time - start_post_time)
# end full pipeline timer
end_full_time = time.time()
# add elapsed iteration time
timers["full pipeline"].append(end_full_time - start_full_time)
C++
// start post-process timer
auto start_post_time = high_resolution_clock::now();
// split the output flow into 2 vectors
cv::Mat flow_xy[2], flow_x, flow_y;
split(flow, flow_xy);
// get the result
flow_x = flow_xy[0];
flow_y = flow_xy[1];
// convert from cartesian to polar coordinates
cv::cartToPolar(flow_x, flow_y, magnitude, angle, true);
// normalize magnitude from 0 to 1
cv::normalize(magnitude, normalized_magnitude, 0.0, 1.0, NORM_MINMAX);
// get angle of optical flow
angle *= ((1 / 360.0) * (180 / 255.0));
// build hsv image
hsv[0] = angle;
hsv[2] = normalized_magnitude;
merge(hsv, 3, merged_hsv);
// multiply each pixel value to 255
merged_hsv.convertTo(hsv_8u, CV_8U, 255);
// convert hsv to bgr
cv::cvtColor(hsv_8u, bgr, COLOR_HSV2BGR);
// update previous_frame value
previous_frame = current_frame;
// end post pipeline timer
auto end_post_time = high_resolution_clock::now();
// add elapsed iteration time
timers["post-process"].push_back(duration_cast<milliseconds>(end_post_time - start_post_time).count() / 1000.0);
// end full pipeline timer
auto end_full_time = high_resolution_clock::now();
// add elapsed iteration time
timers["full pipeline"].push_back(duration_cast<milliseconds>(end_full_time - start_full_time).count() / 1000.0);
6. Visualization
We visualize the original frame resized to 960×540 and the result using imshow
function:
Python
# visualization
cv2.imshow("original", frame)
cv2.imshow("result", bgr)
k = cv2.waitKey(1)
if k == 27:
break
C++
// visualization
imshow("original", frame);
imshow("result", bgr);
int keyboard = waitKey(1);
if (keyboard == 27)
break;
Here’s what we get with a sample “boat.mp4” video:
7. Time and FPS Calculation
All we have to do is to calculate the elapsed time at each step of the pipeline and measure FPS for optical flow part and the full pipeline:
Python
# elapsed time at each stage
print("Elapsed time")
for stage, seconds in timers.items():
print("-", stage, ": {:0.3f} seconds".format(sum(seconds)))
# calculate frames per second
print("Default video FPS : {:0.3f}".format(fps))
of_fps = (num_frames - 1) / sum(timers["optical flow"])
print("Optical flow FPS : {:0.3f}".format(of_fps))
full_fps = (num_frames - 1) / sum(timers["full pipeline"])
print("Full pipeline FPS : {:0.3f}".format(full_fps))
C++
// elapsed time at each stage
cout << "Elapsed time" << std::endl;
for (auto const& timer : timers)
{
cout << "- " << timer.first << " : " << accumulate(timer.second.begin(), timer.second.end(), 0.0) << " seconds"<< endl;
}
// calculate frames per second
cout << "Default video FPS : " << fps << endl;
float optical_flow_fps = (num_frames - 1) / accumulate(timers["optical flow"].begin(), timers["optical flow"].end(), 0.0);
cout << "Optical flow FPS : " << optical_flow_fps << endl;
float full_pipeline_fps = (num_frames - 1) / accumulate(timers["full pipeline"].begin(), timers["full pipeline"].end(), 0.0);
cout << "Full pipeline FPS : " << full_pipeline_fps << endl;
GPU Pipeline
The algorithm stays the same with moving it to CUDA but has some differences connected to the GPU usage. Let’s go through the pipeline once again and see what has changed:
1. Video and Its Attributes
This part is common in both CPU and GPU part, so it stays the same.
2. Reading the First Frame
Notice, that we use the same CPU functions for reading and resizing, but upload the result to cv::cuda::GpuMat (cuda_GpuMat)
instance:
Python
# proceed if frame reading was successful
if ret:
# resize frame
frame = cv2.resize(previous_frame, (960, 540))
# upload resized frame to GPU
gpu_frame = cv2.cuda_GpuMat()
gpu_frame.upload(frame)
# convert to gray
previous_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# upload pre-processed frame to GPU
gpu_previous = cv2.cuda_GpuMat()
gpu_previous.upload(previous_frame)
# create gpu_hsv output for optical flow
gpu_hsv = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC3)
gpu_hsv_8u = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_8UC3)
gpu_h = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)
gpu_s = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)
gpu_v = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)
# set saturation to 1
gpu_s.upload(np.ones_like(previous_frame, np.float32))
C++
// resize frame
cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);
// convert to gray
cv::cvtColor(frame, previous_frame, COLOR_BGR2GRAY);
// upload pre-processed frame to GPU
cv::cuda::GpuMat gpu_previous;
gpu_previous.upload(previous_frame);
// declare cpu outputs for optical flow
cv::Mat hsv[3], angle, bgr;
// declare gpu outputs for optical flow
cv::cuda::GpuMat gpu_magnitude, gpu_normalized_magnitude, gpu_angle;
cv::cuda::GpuMat gpu_hsv[3], gpu_merged_hsv, gpu_hsv_8u, gpu_bgr;
// set saturation to 1
hsv[1] = cv::Mat::ones(frame.size(), CV_32F);
gpu_hsv[1].upload(hsv[1]);
3. Reading and Pre-processing Other Frames
Python
while True:
# start full pipeline timer
start_full_time = time.time()
# start reading timer
start_read_time = time.time()
# capture frame-by-frame
ret, frame = cap.read()
# upload frame to GPU
gpu_frame.upload(frame)
# end reading timer
end_read_time = time.time()
# add elapsed iteration time
timers["reading"].append(end_read_time - start_read_time)
# if frame reading was not successful, break
if not ret:
break
# start pre-process timer
start_pre_time = time.time()
# resize frame
gpu_frame = cv2.cuda.resize(gpu_frame, (960, 540))
# convert to gray
gpu_current = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)
# end pre-process timer
end_pre_time = time.time()
C++
while (true)
{
// start full pipeline timer
auto start_full_time = high_resolution_clock::now();
// start reading timer
auto start_read_time = high_resolution_clock::now();
// capture frame-by-frame
capture >> frame;
if (frame.empty())
break;
// upload frame to GPU
cv::cuda::GpuMat gpu_frame;
gpu_frame.upload(frame);
// end reading timer
auto end_read_time = high_resolution_clock::now();
// add elapsed iteration time
timers["reading"].push_back(duration_cast<milliseconds>(end_read_time - start_read_time).count() / 1000.0);
// start pre-process timer
auto start_pre_time = high_resolution_clock::now();
// resize frame
cv::cuda::resize(gpu_frame, gpu_frame, Size(960, 540), 0, 0, INTER_LINEAR);
// convert to gray
cv::cuda::GpuMat gpu_current;
cv::cuda::cvtColor(gpu_frame, gpu_current, COLOR_BGR2GRAY);
// end pre-process timer
auto end_pre_time = high_resolution_clock::now();
// add elapsed iteration time
timers["pre-process"].push_back(duration_cast<milliseconds>(end_pre_time - start_pre_time).count() / 1000.0);
4. Calculating Dense Optical Flow
Instead of using cv::calcOpticalFlowFarneback (cv2.calcOpticalFlowFarneback)
function call, we first use cv::cuda::FarnebackOpticalFlow::create (cv2.cuda_FarnebackOpticalFlow.create)
to create an instance of cuda_FarnebackOpticalFlow class and then call cv::cuda::FarnebackOpticalFlow::calc(cv2.cuda_FarnebackOpticalFlow.calc)
to calculate optical flow between two frames:
Python
# start optical flow timer
start_of = time.time()
# create optical flow instance
gpu_flow = cv2.cuda_FarnebackOpticalFlow.create(
5, 0.5, False, 15, 3, 5, 1.2, 0,
)
# calculate optical flow
gpu_flow = cv2.cuda_FarnebackOpticalFlow.calc(
gpu_flow, gpu_previous, gpu_current, None,
)
# end of timer
end_of = time.time()
# add elapsed iteration time
timers["optical flow"].append(end_of - start_of)
C++
// start optical flow timer
auto start_of_time = high_resolution_clock::now();
// create optical flow instance
Ptr<cuda::FarnebackOpticalFlow> ptr_calc = cuda::FarnebackOpticalFlow::create(5, 0.5, false, 15, 3, 5, 1.2, 0);
// calculate optical flow
cv::cuda::GpuMat gpu_flow;
ptr_calc->calc(gpu_previous, gpu_current, gpu_flow);
// end optical flow timer
auto end_of_time = high_resolution_clock::now();
// add elapsed iteration time
timers["optical flow"].push_back(duration_cast<milliseconds>(end_of_time - start_of_time).count() / 1000.0);
5. Post-processing
For post-processing, we use GPU variant of the same function as we used in CPU pipeline:
Python
# start post-process timer
start_post_time = time.time()
gpu_flow_x = cv2.cuda_GpuMat(gpu_flow.size(), cv2.CV_32FC1)
gpu_flow_y = cv2.cuda_GpuMat(gpu_flow.size(), cv2.CV_32FC1)
cv2.cuda.split(gpu_flow, [gpu_flow_x, gpu_flow_y])
# convert from cartesian to polar coordinates to get magnitude and angle
gpu_magnitude, gpu_angle = cv2.cuda.cartToPolar(
gpu_flow_x, gpu_flow_y, angleInDegrees=True,
)
# set value to normalized magnitude from 0 to 1
gpu_v = cv2.cuda.normalize(gpu_magnitude, 0.0, 1.0, cv2.NORM_MINMAX, -1)
# get angle of optical flow
angle = gpu_angle.download()
angle *= (1 / 360.0) * (180 / 255.0)
# set hue according to the angle of optical flow
gpu_h.upload(angle)
# merge h,s,v channels
cv2.cuda.merge([gpu_h, gpu_s, gpu_v], gpu_hsv)
# multiply each pixel value to 255
gpu_hsv.convertTo(cv2.CV_8U, 255.0, gpu_hsv_8u, 0.0)
# convert hsv to bgr
gpu_bgr = cv2.cuda.cvtColor(gpu_hsv_8u, cv2.COLOR_HSV2BGR)
# send original frame from GPU back to CPU
frame = gpu_frame.download()
# send result from GPU back to CPU
bgr = gpu_bgr.download()
# update previous_frame value
gpu_previous = gpu_current
# end post-process timer
end_post_time = time.time()
# add elapsed iteration time
timers["post-process"].append(end_post_time - start_post_time)
# end full pipeline timer
end_full_time = time.time()
# add elapsed iteration time
timers["full pipeline"].append(end_full_time - start_full_time)
C++
// start post-process timer
auto start_post_time = high_resolution_clock::now();
// split the output flow into 2 vectors
cv::cuda::GpuMat gpu_flow_xy[2];
cv::cuda::split(gpu_flow, gpu_flow_xy);
// convert from cartesian to polar coordinates
cv::cuda::cartToPolar(gpu_flow_xy[0], gpu_flow_xy[1], gpu_magnitude, gpu_angle, true);
// normalize magnitude from 0 to 1
cv::cuda::normalize(gpu_magnitude, gpu_normalized_magnitude, 0.0, 1.0, NORM_MINMAX, -1);
// get angle of optical flow
gpu_angle.download(angle);
angle *= ((1 / 360.0) * (180 / 255.0));
// build hsv image
gpu_hsv[0].upload(angle);
gpu_hsv[2] = gpu_normalized_magnitude;
cv::cuda::merge(gpu_hsv, 3, gpu_merged_hsv);
// multiply each pixel value to 255
gpu_merged_hsv.cv::cuda::GpuMat::convertTo(gpu_hsv_8u, CV_8U, 255.0);
// convert hsv to bgr
cv::cuda::cvtColor(gpu_hsv_8u, gpu_bgr, COLOR_HSV2BGR);
// send original frame from GPU back to CPU
gpu_frame.download(frame);
// send result from GPU back to CPU
gpu_bgr.download(bgr);
// update previous_frame value
gpu_previous = gpu_current;
// end post pipeline timer
auto end_post_time = high_resolution_clock::now();
// add elapsed iteration time
timers["post-process"].push_back(duration_cast<milliseconds>(end_post_time - start_post_time).count() / 1000.0);
// end full pipeline timer
auto end_full_time = high_resolution_clock::now();
// add elapsed iteration time
timers["full pipeline"].push_back(duration_cast<milliseconds>(end_full_time - start_full_time).count() / 1000.0);
Also note, that we use download
function to move the result back to CPU before visualization.
6. Visualization
The visualization part is common for CPU and GPU pipelines and stays the same.
7. Time and FPS Calculation
That stage also stays the same.
Results
Now we’re ready to compare metrics from CPU and GPU versions on a sample video. The configuration we use for CPU is:
Intel Core i7-8700
After running the script using a CPU device the result is:
Configuration
- device : cpu
- video file : video/boat.mp4
Number of frames: 320
Elapsed time
- full pipeline : 37.355 seconds
- reading : 3.327 seconds
- pre-process : 0.027 seconds
- optical flow : 32.706 seconds
- post-process : 0.641 seconds
Default video FPS : 29.97
Optical flow FPS : 9.75356
Full pipeline FPS : 8.53969
The configuration we use for GPU is:
Nvidia GeForce GTX 1080 Ti
And after running the script using a GPU device we get:
Configuration
- device : gpu
- video file : video/boat.mp4
Number of frames: 320
Elapsed time
- full pipeline : 8.665 seconds
- reading : 4.821 seconds
- pre-process : 0.035 seconds
- optical flow : 1.874 seconds
- post-process : 0.631 seconds
Default video FPS : 29.97
Optical flow FPS : 170.224
Full pipeline FPS : 36.8148
That gives us a ~17x speedup of the optical flow calculation when we use CUDA-acceleration! But unfortunately, we live in a real-word, where not all of the stages of the pipeline can be accelerated. Because of that, for the whole pipeline, we only got ~4 times speedup.
Conclusion
In our today’s post, we’ve overviewed the GPU OpenCV module and wrote a simple demo to find out how Farneback’s Optical Flow algorithm can be accelerated. We looked at API that OpenCV provides for this module, which you can reuse to try your hands at accelerating OpenCV algorithms with CUDA as well.