Getting Started with OpenCV CUDA Module

If you have been working with OpenCV for some time, you should have noticed that in most scenarios OpenCV utilizes CPU, which doesn’t always guarantee you the desired performance. To tackle this problem, in 2010 a new module that provides GPU acceleration using CUDA was added to OpenCV. You can find a benchmark demonstrating the advantage of the GPU module below:

**Fig. 1:** Comparison between OpenCV algorithms on CPU and with CUDA

To find out the benchmark details, you can refer to the Realtime Computer Vision with OpenCV article.

NOTE: This post is about the CUDA-accelerated part of OpenCV, not the DNN module that could also utilize GPU. The latter we will discuss in the other post.

Overview

Let’s briefly list what we will do in this post:

Overview OpenCV modules that already have support for CUDA.
Take a look at the basic block cv::gpu::GpuMat (cv2.cuda_GpuMat).
Learn how to transfer data between CPU and GPU.
Learn how to utilize multiple GPUs.
Write a simple demo (both C++ and Python) to get to know the CUDA support API provided by OpenCV and to calculate the performance boost we can gain.

Supported Modules

Even though not all the library’s functionality is covered, it is claimed, that
the module “still keeps growing and is being adapted for the new computing technologies and GPU architectures.”

Let’s take a look at the official documentation of the CUDA-accelerated OpenCV. Here we can see listed modules that are already supported:

Core part
Operations on Matrices
Background Segmentation
Video Encoding/Decoding
Feature Detection and Description
Image Filtering
Image Processing
Legacy support
Object Detection
Optical Flow
Stereo Correspondence
Image Warping
Device layer

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Basic Block – GpuMat

To keep data in GPU memory, OpenCV introduces a new class cv::gpu::GpuMat (or cv2.cuda_GpuMat in Python) which serves as a primary data container. Its interface is similar to cv::Mat (cv2.Mat) making the transition to the GPU module as smooth as possible. Another thing worth mentioning is that all GPU functions receive GpuMat as input and output arguments. You can reduce the overhead on copying the data between the CPU and GPU with such design having chained GPU algorithms in your code.

CPU/GPU Data Transfer

To transfer data from GpuMat to Mat and vice-versa, OpenCV provides two functions:

upload, which copies data from host memory to device memory
download, which copies data from device memory to host memory.

Below is a simple example in C++ of their usage in a context:

#include <opencv2/highgui.hpp>
#include <opencv2/cudaimgproc.hpp>

cv::Mat img = cv::imread("image.png", IMREAD_GRAYSCALE);
cv::cuda::GpuMat dst, src;
src.upload(img);

cv::Ptr<cv::cuda::CLAHE> ptr_clahe = cv::cuda::createCLAHE(5.0, cv::Size(8, 8));
ptr_clahe->apply(src, dst);

cv::Mat result;
dst.download(result);

cv::imshow("result", result);
cv::waitKey();

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

And the same example in Python:

img = cv2.imread("image.png", cv2.IMREAD_GRAYSCALE)
src = cv2.cuda_GpuMat()
src.upload(img)

clahe = cv2.cuda.createCLAHE(clipLimit=5.0, tileGridSize=(8, 8))
dst = clahe.apply(src, cv2.cuda_Stream.Null())

result = dst.download()

cv2.imshow("result", result)
cv2.waitKey(0)

Utilizing Multiple GPUs

By default, each of the OpenCV CUDA algorithms uses a single GPU. If you need to utilize multiple GPUs, you have to manually distribute the work between GPUs. To switch active device use cv::cuda::setDevice (cv2.cuda.SetDevice) function.

Sample Demo

OpenCV provides samples on how to work with already implemented methods with GPU support using C++ API. But not so much information comes up when you want to try out Python API, which is also supported. Let’s implement a simple demo on how to use CUDA-accelerated OpenCV with C++ and Python API on the example of dense optical flow calculation using Farneback’s algorithm.

We will first take a look at how this could be done using the CPU. Then we will do the same using GPU. And finally, we are going to compare the elapsed time to calculate the gained speedup. Check out the README.md file with proper installation instructions before you start if you’d like to run the code yourself.

FPS Calculation

Since our primary goal is to find out how fast the algorithm works on different devices, we need to choose how we can measure it. A common way of doing so in the Computer Vision field is to calculate the number of processed frames per second (FPS). You can take a look at our earlier post for a quick reminder of how it could be done.

CPU Pipeline

1. Video and Its Attributes

We will start with video capture initialization and getting its attributes such as frame rate and a number of its frames. This part is common for CPU and GPU part:

Python

    # init video capture with video
    cap = cv2.VideoCapture(video)

    # get default video FPS
    fps = cap.get(cv2.CAP_PROP_FPS)

    # get total number of video frames
    num_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)

C++

    // init video capture with video
    VideoCapture capture(videoFileName);
    if (!capture.isOpened())
    {
        // error in opening the video file
        cout << "Unable to open file!" << endl;
        return;
    }

    // get default video FPS
    double fps = capture.get(CAP_PROP_FPS);

    // get total number of video frames
    int num_frames = int(capture.get(CAP_PROP_FRAME_COUNT));

2. Reading the First Frame

Because of the specificity of the algorithm, that uses two frames for calculation, we need to read the first frame before we move on. Some pre-processing is also needed such as resizing and converting to grayscale:

Python

    # read the first frame
    ret, previous_frame = cap.read()

    if device == "cpu":

        # proceed if frame reading was successful
        if ret:
            # resize frame
            frame = cv2.resize(previous_frame, (960, 540))

            # convert to gray
            previous_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

            # create hsv output for optical flow
            hsv = np.zeros_like(frame, np.float32)

            # set saturation to 1
            hsv[..., 1] = 1.0

C++

    // read the first frame
    cv::Mat frame, previous_frame;
    capture >> frame;

    if (device == "cpu")
    {
        // resize frame
        cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);

        // convert to gray
        cv::cvtColor(frame, previous_frame, COLOR_BGR2GRAY);

        // declare outputs for optical flow
        cv::Mat magnitude, normalized_magnitude, angle;
        cv::Mat hsv[3], merged_hsv, hsv_8u, bgr;

        // set saturation to 1
        hsv[1] = cv::Mat::ones(frame.size(), CV_32F);

You may notice, that we’ve also created an output frame, which we will use later.

3. Reading and Pre-processing Other Frames

Before reading the rest frames in a loop, we start two timers: one will track the full pipeline working time, the second one – reading frame time. Since Farneback’s Optical Flow algorithm works with grayscale frames, we need to make sure, we’re passing a grayscale video as an input. That’s why we first pre-process it to convert each frame from BGR format to grayscale. Also, since the original resolution might be too large, we resize it to a smaller size the same way as we did it for the first frame. We set up one more timer to calculate the time spent on the pre-processing stage:

Python

                while True:
                # start full pipeline timer
                start_full_time = time.time()

                # start reading timer
                start_read_time = time.time()

                # capture frame-by-frame
                ret, frame = cap.read()

                # end reading timer
                end_read_time = time.time()

                # add elapsed iteration time
                timers["reading"].append(end_read_time - start_read_time)

                # if frame reading was not successful, break
                if not ret:
                    break

                # start pre-process timer
                start_pre_time = time.time()
                # resize frame
                frame = cv2.resize(frame, (960, 540))

                # convert to gray
                current_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

                # end pre-process timer
                end_pre_time = time.time()

                # add elapsed iteration time
                timers["pre-process"].append(end_pre_time - start_pre_time)

C++

        while (true)
        {
            // start full pipeline timer
            auto start_full_time = high_resolution_clock::now();

            // start reading timer
            auto start_read_time = high_resolution_clock::now();

            // capture frame-by-frame
            capture >> frame;

            if (frame.empty())
                break;

            // end reading timer
            auto end_read_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["reading"].push_back(duration_cast<milliseconds>(end_read_time - start_read_time).count() / 1000.0);

            // start pre-process timer
            auto start_pre_time = high_resolution_clock::now();

            // resize frame
            cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);

            // convert to gray
            cv::Mat current_frame;
            cv::cvtColor(frame, current_frame, COLOR_BGR2GRAY);

            // end pre-process timer
            auto end_pre_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["pre-process"].push_back(duration_cast<milliseconds>(end_pre_time - start_pre_time).count() / 1000.0);

4. Calculating Dense Optical Flow

We use the corresponding method called calcOpticalFlowFarneback to calculate a dense optical flow between two frames:

Python

                # start optical flow timer
                start_of = time.time()

                # calculate optical flow
                flow = cv2.calcOpticalFlowFarneback(
                    previous_frame, current_frame, None, 0.5, 5, 15, 3, 5, 1.2, 0,
                )
                # end of timer
                end_of = time.time()

                # add elapsed iteration time
                timers["optical flow"].append(end_of - start_of)

C++

            // start optical flow timer
            auto start_of_time = high_resolution_clock::now();

            // calculate optical flow
            cv::Mat flow;
            calcOpticalFlowFarneback(previous_frame, current_frame, flow, 0.5, 5, 15, 3, 5, 1.2, 0);

            // end optical flow timer
            auto end_of_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["optical flow"].push_back(duration_cast<milliseconds>(end_of_time - start_of_time).count() / 1000.0);

We wrap its usage in-between two timers calls, again, to calculate the elapsed time.

5. Post-processing

Farneback’s Optical Flow algorithm output a two-dimensional flow vector. We convert these outputs to polar coordinates to obtain the angle (direction) of flow by hue and the distance (magnitude) of flow by value of HSV color representation. For visualization, all we have left to do now is to convert the result to BGR space. After that we stop all the remained timers to get the elapsed time:

Python

                # start post-process timer
                start_post_time = time.time()

                # convert from cartesian to polar coordinates to get magnitude and angle
                magnitude, angle = cv2.cartToPolar(
                    flow[..., 0], flow[..., 1], angleInDegrees=True,
                )

                # set hue according to the angle of optical flow
                hsv[..., 0] = angle * ((1 / 360.0) * (180 / 255.0))

                # set value according to the normalized magnitude of optical flow
                hsv[..., 2] = cv2.normalize(
                    magnitude, None, 0.0, 1.0, cv2.NORM_MINMAX, -1,
                )

                # multiply each pixel value to 255
                hsv_8u = np.uint8(hsv * 255.0)

                # convert hsv to bgr
                bgr = cv2.cvtColor(hsv_8u, cv2.COLOR_HSV2BGR)

                # update previous_frame value
                previous_frame = current_frame

                # end post-process timer
                end_post_time = time.time()

                # add elapsed iteration time
                timers["post-process"].append(end_post_time - start_post_time)

                # end full pipeline timer
                end_full_time = time.time()

                # add elapsed iteration time
                timers["full pipeline"].append(end_full_time - start_full_time)

C++

            // start post-process timer
            auto start_post_time = high_resolution_clock::now();

            // split the output flow into 2 vectors
            cv::Mat flow_xy[2], flow_x, flow_y;
            split(flow, flow_xy);

            // get the result
            flow_x = flow_xy[0];
            flow_y = flow_xy[1];

            // convert from cartesian to polar coordinates
            cv::cartToPolar(flow_x, flow_y, magnitude, angle, true);

            // normalize magnitude from 0 to 1
            cv::normalize(magnitude, normalized_magnitude, 0.0, 1.0, NORM_MINMAX);

            // get angle of optical flow
            angle *= ((1 / 360.0) * (180 / 255.0));

            // build hsv image
            hsv[0] = angle;
            hsv[2] = normalized_magnitude;
            merge(hsv, 3, merged_hsv);

            // multiply each pixel value to 255
            merged_hsv.convertTo(hsv_8u, CV_8U, 255);

            // convert hsv to bgr
            cv::cvtColor(hsv_8u, bgr, COLOR_HSV2BGR);

            // update previous_frame value
            previous_frame = current_frame;

            // end post pipeline timer
            auto end_post_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["post-process"].push_back(duration_cast<milliseconds>(end_post_time - start_post_time).count() / 1000.0);

            // end full pipeline timer
            auto end_full_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["full pipeline"].push_back(duration_cast<milliseconds>(end_full_time - start_full_time).count() / 1000.0);

6. Visualization

We visualize the original frame resized to 960×540 and the result using imshow function:

Python

                # visualization
                cv2.imshow("original", frame)
                cv2.imshow("result", bgr)
                k = cv2.waitKey(1)
                if k == 27:
                    break

C++

            // visualization
            imshow("original", frame);
            imshow("result", bgr);
            int keyboard = waitKey(1);
            if (keyboard == 27)
                break;

Here’s what we get with a sample “boat.mp4” video:

original frame and result Farneback’s Dense Optical Flow algorithm — **Fig. 2:** On the left: the original video frame. On the right: result after post-processing of applying Farneback’s Dense Optical Flow algorithm.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

7. Time and FPS Calculation

All we have to do is to calculate the elapsed time at each step of the pipeline and measure FPS for optical flow part and the full pipeline:

Python

    # elapsed time at each stage
    print("Elapsed time")
    for stage, seconds in timers.items():
        print("-", stage, ": {:0.3f} seconds".format(sum(seconds)))

    # calculate frames per second
    print("Default video FPS : {:0.3f}".format(fps))

    of_fps = (num_frames - 1) / sum(timers["optical flow"])
    print("Optical flow FPS : {:0.3f}".format(of_fps))

    full_fps = (num_frames - 1) / sum(timers["full pipeline"])
    print("Full pipeline FPS : {:0.3f}".format(full_fps))

C++

    // elapsed time at each stage
    cout << "Elapsed time" << std::endl;
    for (auto const& timer : timers)
    {
        cout << "- " << timer.first << " : " << accumulate(timer.second.begin(), timer.second.end(), 0.0) << " seconds"<< endl;
    }

    // calculate frames per second
    cout << "Default video FPS : "  << fps << endl;
    float optical_flow_fps  = (num_frames - 1) / accumulate(timers["optical flow"].begin(),  timers["optical flow"].end(),  0.0);
    cout << "Optical flow FPS : "   << optical_flow_fps  << endl;

    float full_pipeline_fps = (num_frames - 1) / accumulate(timers["full pipeline"].begin(), timers["full pipeline"].end(), 0.0);
    cout << "Full pipeline FPS : "  << full_pipeline_fps << endl;

GPU Pipeline

The algorithm stays the same with moving it to CUDA but has some differences connected to the GPU usage. Let’s go through the pipeline once again and see what has changed:

1. Video and Its Attributes

This part is common in both CPU and GPU part, so it stays the same.

2. Reading the First Frame

Notice, that we use the same CPU functions for reading and resizing, but upload the result to cv::cuda::GpuMat (cuda_GpuMat) instance:

Python

        # proceed if frame reading was successful
        if ret:
            # resize frame
            frame = cv2.resize(previous_frame, (960, 540))

            # upload resized frame to GPU
            gpu_frame = cv2.cuda_GpuMat()
            gpu_frame.upload(frame)

            # convert to gray
            previous_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

            # upload pre-processed frame to GPU
            gpu_previous = cv2.cuda_GpuMat()
            gpu_previous.upload(previous_frame)

            # create gpu_hsv output for optical flow
            gpu_hsv = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC3)
            gpu_hsv_8u = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_8UC3)

            gpu_h = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)
            gpu_s = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)
            gpu_v = cv2.cuda_GpuMat(gpu_frame.size(), cv2.CV_32FC1)

            # set saturation to 1
            gpu_s.upload(np.ones_like(previous_frame, np.float32))

C++

        // resize frame
        cv::resize(frame, frame, Size(960, 540), 0, 0, INTER_LINEAR);

        // convert to gray
        cv::cvtColor(frame, previous_frame, COLOR_BGR2GRAY);

        // upload pre-processed frame to GPU
        cv::cuda::GpuMat gpu_previous;
        gpu_previous.upload(previous_frame);

        // declare cpu outputs for optical flow
        cv::Mat hsv[3], angle, bgr;

        // declare gpu outputs for optical flow
        cv::cuda::GpuMat gpu_magnitude, gpu_normalized_magnitude, gpu_angle;
        cv::cuda::GpuMat gpu_hsv[3], gpu_merged_hsv, gpu_hsv_8u, gpu_bgr;

        // set saturation to 1
        hsv[1] = cv::Mat::ones(frame.size(), CV_32F);
        gpu_hsv[1].upload(hsv[1]);

3. Reading and Pre-processing Other Frames

Python

            while True:
                # start full pipeline timer
                start_full_time = time.time()

                # start reading timer
                start_read_time = time.time()

                # capture frame-by-frame
                ret, frame = cap.read()

                # upload frame to GPU
                gpu_frame.upload(frame)

                # end reading timer
                end_read_time = time.time()

                # add elapsed iteration time
                timers["reading"].append(end_read_time - start_read_time)

                # if frame reading was not successful, break
                if not ret:
                    break

                # start pre-process timer
                start_pre_time = time.time()

                # resize frame
                gpu_frame = cv2.cuda.resize(gpu_frame, (960, 540))

                # convert to gray
                gpu_current = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)

                # end pre-process timer
                end_pre_time = time.time()

C++

        while (true)
        {
            // start full pipeline timer
            auto start_full_time = high_resolution_clock::now();

            // start reading timer
            auto start_read_time = high_resolution_clock::now();

            // capture frame-by-frame
            capture >> frame;

            if (frame.empty())
                break;

            // upload frame to GPU
            cv::cuda::GpuMat gpu_frame;
            gpu_frame.upload(frame);

            // end reading timer
            auto end_read_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["reading"].push_back(duration_cast<milliseconds>(end_read_time - start_read_time).count() / 1000.0);

            // start pre-process timer
            auto start_pre_time = high_resolution_clock::now();

            // resize frame
            cv::cuda::resize(gpu_frame, gpu_frame, Size(960, 540), 0, 0, INTER_LINEAR);

            // convert to gray
            cv::cuda::GpuMat gpu_current;
            cv::cuda::cvtColor(gpu_frame, gpu_current, COLOR_BGR2GRAY);

            // end pre-process timer
            auto end_pre_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["pre-process"].push_back(duration_cast<milliseconds>(end_pre_time - start_pre_time).count() / 1000.0);

4. Calculating Dense Optical Flow

Instead of using cv::calcOpticalFlowFarneback (cv2.calcOpticalFlowFarneback) function call, we first use cv::cuda::FarnebackOpticalFlow::create (cv2.cuda_FarnebackOpticalFlow.create) to create an instance of cuda_FarnebackOpticalFlow class and then call cv::cuda::FarnebackOpticalFlow::calc(cv2.cuda_FarnebackOpticalFlow.calc) to calculate optical flow between two frames:

Python

                # start optical flow timer
                start_of = time.time()

                # create optical flow instance
                gpu_flow = cv2.cuda_FarnebackOpticalFlow.create(
                    5, 0.5, False, 15, 3, 5, 1.2, 0,
                )
                # calculate optical flow
                gpu_flow = cv2.cuda_FarnebackOpticalFlow.calc(
                    gpu_flow, gpu_previous, gpu_current, None,
                )

                # end of timer
                end_of = time.time()

                # add elapsed iteration time
                timers["optical flow"].append(end_of - start_of)

C++

            // start optical flow timer
            auto start_of_time = high_resolution_clock::now();

            // create optical flow instance
            Ptr<cuda::FarnebackOpticalFlow> ptr_calc = cuda::FarnebackOpticalFlow::create(5, 0.5, false, 15, 3, 5, 1.2, 0);
            // calculate optical flow
            cv::cuda::GpuMat gpu_flow;
            ptr_calc->calc(gpu_previous, gpu_current, gpu_flow);

            // end optical flow timer
            auto end_of_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["optical flow"].push_back(duration_cast<milliseconds>(end_of_time - start_of_time).count() / 1000.0);

5. Post-processing

For post-processing, we use GPU variant of the same function as we used in CPU pipeline:

Python

                # start post-process timer
                start_post_time = time.time()

                gpu_flow_x = cv2.cuda_GpuMat(gpu_flow.size(), cv2.CV_32FC1)
                gpu_flow_y = cv2.cuda_GpuMat(gpu_flow.size(), cv2.CV_32FC1)
                cv2.cuda.split(gpu_flow, [gpu_flow_x, gpu_flow_y])

                # convert from cartesian to polar coordinates to get magnitude and angle
                gpu_magnitude, gpu_angle = cv2.cuda.cartToPolar(
                    gpu_flow_x, gpu_flow_y, angleInDegrees=True,
                )

                # set value to normalized magnitude from 0 to 1
                gpu_v = cv2.cuda.normalize(gpu_magnitude, 0.0, 1.0, cv2.NORM_MINMAX, -1)

                # get angle of optical flow
                angle = gpu_angle.download()
                angle *= (1 / 360.0) * (180 / 255.0)

                # set hue according to the angle of optical flow
                gpu_h.upload(angle)

                # merge h,s,v channels
                cv2.cuda.merge([gpu_h, gpu_s, gpu_v], gpu_hsv)

                # multiply each pixel value to 255
                gpu_hsv.convertTo(cv2.CV_8U, 255.0, gpu_hsv_8u, 0.0)

                # convert hsv to bgr
                gpu_bgr = cv2.cuda.cvtColor(gpu_hsv_8u, cv2.COLOR_HSV2BGR)

                # send original frame from GPU back to CPU
                frame = gpu_frame.download()

                # send result from GPU back to CPU
                bgr = gpu_bgr.download()

                # update previous_frame value
                gpu_previous = gpu_current

                # end post-process timer
                end_post_time = time.time()

                # add elapsed iteration time
                timers["post-process"].append(end_post_time - start_post_time)

                # end full pipeline timer
                end_full_time = time.time()

                # add elapsed iteration time
                timers["full pipeline"].append(end_full_time - start_full_time)

C++

            // start post-process timer
            auto start_post_time = high_resolution_clock::now();

            // split the output flow into 2 vectors
            cv::cuda::GpuMat gpu_flow_xy[2];
            cv::cuda::split(gpu_flow, gpu_flow_xy);

            // convert from cartesian to polar coordinates
            cv::cuda::cartToPolar(gpu_flow_xy[0], gpu_flow_xy[1], gpu_magnitude, gpu_angle, true);

            // normalize magnitude from 0 to 1
            cv::cuda::normalize(gpu_magnitude, gpu_normalized_magnitude, 0.0, 1.0, NORM_MINMAX, -1);

            // get angle of optical flow
            gpu_angle.download(angle);
            angle *= ((1 / 360.0) * (180 / 255.0));

            // build hsv image
            gpu_hsv[0].upload(angle);
            gpu_hsv[2] = gpu_normalized_magnitude;
            cv::cuda::merge(gpu_hsv, 3, gpu_merged_hsv);

            // multiply each pixel value to 255
            gpu_merged_hsv.cv::cuda::GpuMat::convertTo(gpu_hsv_8u, CV_8U, 255.0);

            // convert hsv to bgr
            cv::cuda::cvtColor(gpu_hsv_8u, gpu_bgr, COLOR_HSV2BGR);

            // send original frame from GPU back to CPU
            gpu_frame.download(frame);

            // send result from GPU back to CPU
            gpu_bgr.download(bgr);

            // update previous_frame value
            gpu_previous = gpu_current;

            // end post pipeline timer
            auto end_post_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["post-process"].push_back(duration_cast<milliseconds>(end_post_time - start_post_time).count() / 1000.0);

            // end full pipeline timer
            auto end_full_time = high_resolution_clock::now();

            // add elapsed iteration time
            timers["full pipeline"].push_back(duration_cast<milliseconds>(end_full_time - start_full_time).count() / 1000.0);

Also note, that we use download function to move the result back to CPU before visualization.

6. Visualization

The visualization part is common for CPU and GPU pipelines and stays the same.

7. Time and FPS Calculation

That stage also stays the same.

Results

Now we’re ready to compare metrics from CPU and GPU versions on a sample video. The configuration we use for CPU is:

Intel Core i7-8700

After running the script using a CPU device the result is:

Configuration
- device : cpu
- video file : video/boat.mp4
Number of frames: 320
Elapsed time
- full pipeline : 37.355 seconds
- reading : 3.327 seconds
- pre-process : 0.027 seconds
- optical flow : 32.706 seconds
- post-process : 0.641 seconds
Default video FPS : 29.97
Optical flow FPS : 9.75356
Full pipeline FPS : 8.53969

The configuration we use for GPU is:

Nvidia GeForce GTX 1080 Ti

And after running the script using a GPU device we get:

Configuration
- device : gpu
- video file : video/boat.mp4
Number of frames: 320
Elapsed time
- full pipeline : 8.665 seconds
- reading : 4.821 seconds
- pre-process : 0.035 seconds
- optical flow : 1.874 seconds
- post-process : 0.631 seconds
Default video FPS : 29.97
Optical flow FPS : 170.224
Full pipeline FPS : 36.8148

That gives us a ~17x speedup of the optical flow calculation when we use CUDA-acceleration! But unfortunately, we live in a real-word, where not all of the stages of the pipeline can be accelerated. Because of that, for the whole pipeline, we only got ~4 times speedup.

Conclusion

In our today’s post, we’ve overviewed the GPU OpenCV module and wrote a simple demo to find out how Farneback’s Optical Flow algorithm can be accelerated. We looked at API that OpenCV provides for this module, which you can reuse to try your hands at accelerating OpenCV algorithms with CUDA as well.

Getting Started with OpenCV CUDA Module

Overview

Supported Modules

Basic Block – GpuMat

CPU/GPU Data Transfer

Utilizing Multiple GPUs

Sample Demo

FPS Calculation

CPU Pipeline

1. Video and Its Attributes

2. Reading the First Frame

3. Reading and Pre-processing Other Frames

4. Calculating Dense Optical Flow

5. Post-processing

6. Visualization

7. Time and FPS Calculation

GPU Pipeline

1. Video and Its Attributes

2. Reading the First Frame

3. Reading and Pre-processing Other Frames

4. Calculating Dense Optical Flow

5. Post-processing

6. Visualization

7. Time and FPS Calculation

Results

Conclusion

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?