Use OpenCV’s CUDA DNN module and YOLOv4 model to accelerate real-time object detection with GPU

5 minute read


Computer Vision is one of the main applications of current deep-learning-based AI wave. Compared to many AI research fields which are still in lab, CV is already largely deployed in production and used in all kinds of scenarios.

Originally developed by Intel and age 20 years now, OpenCV is a perfect tools for computer vision tasks.

In addition to traditional image processing use-cases such as image smoothing, edge detection, etc., OpenCV can also do Deep Neural Network inference. Thus, we can apply state-of-the-art computer vision NN models to project with the help of OpenCV.

Why choose OpenCV?

I have a hobby real-time object-detection project written in C++. After video frame is obtained, I’m facing the choice of frameworks suitable for object-detection on frame.

After investigation, I found that popular AI frameworks such as Tensorflow, Pytorch are stronger at training models. However, for an AI application, only inference part is important.

Since the project already uses OpenCV for other frame handling work, there is no need to import another AI framework and external project dependency is minimized.

Our friend of this task is DNN module. It provides support for deep learning inference. In the past, OpenCV only supports CPU inference which limits its usage, especially for real-time cases.

Fortunately, a sub-module named cuda4dnn is added recently which provides CUDA support for DNN. This sub-module is backed by NVIDIA’s cuDNN library, so inference performance on NVIDIA GPU shall be guaranteed.

code walk-through

The overall inference process is as follows:

1. load module

DNN supports multiple module formats including .pb(tensorflow), .onnx, etc. YOLO’s darknet model is also supported. To start, download pre-trained .weights file here.

Load the pre-trained weights with readNet:

cv::dnn::Net net_ = cv::dnn::readNet(weights_location, cfg_location);

2. set backend and target

DNN is a generalized inference engine supporting different backends including OpenCL, CUDA, FPGA, etc. Here we set CUDA as backend to use GPU for inference.


3. do inference

Use image blob as the input of DNN Net, do computation, then save the results in outs array:

cv::Mat blob = cv::dnn::blobFromImage(img, 1.0 / 255, cv::Size(320, 320), cv::Scalar(), true, false, CV_32F);
std::vector<cv::Mat> outs;
net_.forward(outs, net_.getUnconnectedOutLayersNames());



This is OpenCV world’s tensor/numpy-array, representing an n-dimensional array.

Data is wrapped inside Mat object and no manual garbage-collection is needed, just like std::vector. Data manipulation thus is much simpler than raw array.


The NCHW ordered 4-dimensional Mat transformed from input image. Blob served as the input of DNN Net.

4. NMS(Non-maximum Suppression) filtering

Object detection inference generates lots of similar candidate boxes. NMS is a technique that filters candidates. Here is what NMS process looks like: Explanation of NMS

5. For each detection, draw a bounding box on frame

The following code loops over all detections and draws a bounding box with classification name and confidence on the frame.

for (size_t i = 0; i < indices.size(); ++i)
    int idx = indices[i];
    cv::Rect2d box = boxes[idx];
    float conf = confidences[idx];
    int class_id = classIds[idx];
    add_bounding_box(class_names_, classIds[idx], confidences[idx], box.x, box.y, box.x + box.width, box.y + box.height, img);

We can use OpenCV to draw bounding boxes as well!

add_bounding_box(std::vector<std::string> &class_names, int classId, float conf, int left, int top, int right, int bottom, cv::Mat& frame)
    cv::rectangle(frame, cv::Point(left, top), cv::Point(right, bottom), cv::Scalar(0, 255, 0));

    std::string label = cv::format("%.2f", conf);
    label = class_names[classId] + ": " + label;
    int baseLine;
    cv::Size labelSize = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);

    top = cv::max(top, labelSize.height);
    cv::rectangle(frame, cv::Point(left, top - labelSize.height),
            cv::Point(left + labelSize.width, top + baseLine), cv::Scalar::all(255), cv::FILLED);
    cv::putText(frame, label, cv::Point(left, top), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar());

cv::rectangle and cv:putText are 2 useful functions to help us draw bounding boxes and labels.

How DNN module is implemented internally

Let’s deep dive into OpenCV’s source code to see how DNN is implemented.


This class represents an artificial neural network model.


This function loads model file into Net instance.


The actual inference is done by this function. Let’s see what it looks like:

//source file:
void Net::forward(OutputArrayOfArrays outputBlobs, const String& outputName)

setUpNet is used to construct DNN Net for computation. The forwardToLayer function is backed by specific backend node which does the concrete computation.


This class represents an abstract Neural Network Node. It is the building block of DNN Net.


Different backends inherits this class to provide different backend support of Layer.

How NVIDIA GPU is utilized by DNN module

As the above explanation indicates, cuda4dnn submodule provides CUDA backend support for DNN module. It implements abstract interfaces like BackendNode, and wrap concrete computation code inside.

Example: ReLU

ReLU is a typical activation function defined as: $f(x) = max(0,x)$

Let’s see how ReLU is implemented in cuda4dnn:


ReLUOp is an implementation of CUDABackendNode.

//source file: opencv/modules/dnn/src/cuda4dnn/primitives/activation.hpp
template <class T>
class ReLUOp final : public CUDABackendNode {
    void forward(
        const std::vector<cv::Ptr<BackendWrapper>>& inputs,
        const std::vector<cv::Ptr<BackendWrapper>>& outputs,
        csl::Workspace& workspace) override
        for (int i = 0; i < inputs.size(); i++)
            kernels::relu<T>(stream, output, input, slope);

When forward function is called, ReLUOp class proxies computation to kernels::relu. Definition of kernels::relu is as follows:

template <class T>
void relu(const Stream& stream, Span<T> output, View<T> input, T slope) {
    generic_op<T, relu_functor>(stream, output, input, slope);

CUDA kernel

A CUDA kernel of type relu_functor will then be generated by following function:

template <class T, class Functor, std::size_t N, class ...FunctorArgs>
__global__ void generic_op_vec(Span<T> output, View<T> input, FunctorArgs ...functorArgs) {
    using vector_type = get_vector_type_t<T, N>;

    auto output_vPtr = vector_type::get_pointer(;
    auto input_vPtr = vector_type::get_pointer(;

    Functor functor(functorArgs...);

    for (auto i : grid_stride_range(output.size() / vector_type::size())) {
        vector_type vec;
        v_load(vec, input_vPtr[i]);
        for (int j = 0; j < vector_type::size(); j++)
  [j] = functor([j]);
        v_store(output_vPtr[i], vec);

Launch kernel

Finally, the kernel will be launched by following function:

template <class Kernel, typename ...Args> inline
void launch_kernel(Kernel kernel, Args ...args) {
    auto policy = make_policy(kernel);
    kernel <<<policy.grid, policy.block>>> (args...);

ReLU Functor

Let’s see what the core ReLU logic looks like in relu_functor:

template <class T>
struct relu_functor {
    __device__ relu_functor(T slope_) : slope{slope_} { }
    __device__ T operator()(T value) {
        return value >= T(0) ? value : slope * value;

    T slope;

We can see that cuda4dnn module actually implements Leaky ReLU variant. The signal can be leaked “backward” when input is from negative direction.

Further Reading

  1. Sample source code