This is the third blog post in the Oak series. If you haven’t checked out the previous posts on OAK, check them below.
In this post, we are going to look at how we can run an existing pre-trained model on the oak device and get an inference from it.
- A brief overview/recap of OAK and DepthAI
- Supported models
- Sources of pre-trained models
- Available Neural Network nodes overview
- Pipeline Overview
- Code: Running a pre-trained face detection model
- Output
- Conclusion
1. A Brief Overview of OAK-D and DepthAI
In the previous posts, we got an overview of OAK-D and saw how it offers different cameras to calculate disparity and depth.
OAK-D and Oak-D Lite are not just stereo cameras but also equipped with a Myriad X VPU onboard. VPU, aka Visual Processing Unit, allows OAK-D to perform multiple operations onboard the device like Image manipulations (warping, dewarping, resizing, crop, edge detection, etc.), RGB depth alignment, tracking, and you can even run custom computer vision functions.
The VPU supports the inference of neural networks (as long as it is converted to the blob format). You can even run multiple AI models simultaneously, either in parallel or in series.
This ability of OAK makes it an all-in-one platform for your computer vision needs.
2. Supported Models
OAK cameras can run any AI model, even custom-architectured/built ones. It can even run multiple AI models at the same time, either in parallel or in series.
Before using your custom-trained models, you need to convert them into a MyriadX blob file format – so that they are optimized for the best inference on the MyriadX VPU processor.
Two conversion steps have to be taken to obtain a blob file:
- Use Model Optimizer to produce OpenVINO IR representation (where IR stands for Intermediate Representation)
- Use Model Compiler to convert IR representation into MyriadX blob
Higher-level solutions are also there to make you not get stuck on the model conversion process. You can visit the online MyriadX Blob converter that allows specifying different OpenVINO target versions and supports conversions from Tensorflow, Caffe, OpenVINO IR, and OpenVINO Model Zoo.
For automated usage of blob converter tool, there is a blobconverter PyPi package, that allows compiling MyriadX blobs both from the command line and from the Python script directly.
The latter is what we will be using in the below example.
3. Sources of Pre-trained Models
Following are the few sources that provide ready-to-deploy trained models for OAK.
Open Model Zoo
Open Model Zoo provides a wide variety of free, highly optimized, pre-trained deep-learning models that run blazingly fast on Intel CPUs, GPUs, and VPUs.
This repository contains over 200 neural network models for tasks including Object Detection, Classification, Image Segmentation, Handwriting Recognition, Text-to-Speech, Human Pose Estimation, and others.
There are two kinds of models.
- Intel’s Pre-Trained Models: The team at Intel has trained these models and optimized them to run with OpenVINO.
- Public Pre-Trained Models: These are models created by the AI community and can be easily converted to OpenVINO format using OpenVINO Model Optimizer.
Luxonis Model Zoo
The OpenAi Kit is quickly becoming the go-to embedding platform of choice for many developers of computer vision applications. To aid the users in getting to know the platform’s capabilities better, Luxonis, the creators of OAK, have created DepthAI Model Zoo. It is a growing collection of ready-to-use open-source models for the Luxonis OpenCV AI Kit platform.
You can find models for tasks such as Monocular Depth Estimation, Object Detection, Segmentation, Facial Landmark Detection, Text Detection, Classification, and many more as new models are added to the model zoo.
Modelplace.ai
Modelplace.AI is a marketplace for machine learning models and a platform for the community to share their custom-trained models.
It has a growing collection of OAK-compatible models for various Computer Vision tasks, be it Classification, Object Detection, Pose Estimation, or Text Detection.
It comes with a web interface to try out the model of your liking with your custom images. You can also compare models that perform similar tasks against one another on standard benchmarks.
Don’t forget to check out the previous post for Top sources to find Computer Vision Models.
4. Available Neural Network Nodes Overview
NeuralNetwork
This node runs neural inference using the defined model on input data.
This node gives the raw output of the neural network, which means you have to decode the output yourself.
This is the node you want to use when you are implementing your custom models.
Input:
- Image to perform inference on
Outputs:
- Raw neural network output
- Input passthrough
Syntax:
1 2 | pipeline = dai.Pipeline() nn = pipeline.create(dai.node.NeuralNetwork) |
MobileNetDetectionNetwork
MobileNet detection network node extends the NeuralNetwork node.
The only difference is that this node is specifically for the MobileNet
NN, and it decodes the result of the NN on the device. This means that out of this node is not a byte array but a ImgDetections
that can easily be used in your code.
Inputs:
- Image to perform detection on
Outputs:
- Detection output
- Input image passthrough
Syntax:
1 2 | pipeline = dai.Pipeline() mobilenetDet = pipeline.create(dai.node.MobileNetDetectionNetwork) |
MobileNetSpatialDetectionNetwork
MobileNetSpatial
detection network node works similarly to MobileNet, a detection network node, but along with the detection results, it also outputs the spatial location of the bounding box.
This network node mirrors the functionality of the spacial locator node on top of the mobilenet detection network node. SpacialLocator Node gives the average distance in the ROI of the depth frame.
Inputs:
- Image to perform detection on
- Depth frame
Outputs:
- Detection output
- Input image passthrough
- Depth passthrough
Syntax:
1 2 | pipeline = dai.Pipeline() mobilenetSpatial = pipeline.create(dai.node.MobileNetSpatialDetectionNetwork) |
Similar to MobileNetDetectionNetwork
and MobileNetSpatialDetectionNetwork
, we have YoloDetectionNetwork
and YoloSpatialDetectionNetwork
to get the decoded detection and spatial detection output from a yolo network.
5. Pipeline
6. Code
Import Libraries
1 2 3 | import cv2 import depthai as dai import timeimport blobconverter |
Define Frame size
1 | FRAME_SIZE = ( 640 , 360 ) |
Define the NN model name and input size
Define the Input size, name, and the zoo name from where to download that model (only ‘DepthAI’ and ‘intel’ are supported at the time).
Note: If you define the path to the blob file directly, make sure the MODEL_NAME
and ZOO_TYPE
are None
For this demo, we use the “face-detection-retail-0004” face detection model from DepthAI model zoo.
1 2 3 4 | DET_INPUT_SIZE = ( 300 , 300 ) model_name = "face-detection-retail-0004" zoo_type = "depthai" blob_path = None |
Create Pipeline
Start defining a pipeline
1 | pipeline = dai.Pipeline() |
Define a source – RGB camera
Get the RGB camera frame
1 2 3 4 5 | cam = pipeline.createColorCamera() cam.setPreviewSize(FRAME_SIZE[ 0 ], FRAME_SIZE[ 1 ]) cam.setInterleaved( False ) cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P) cam.setBoardSocket(dai.CameraBoardSocket.RGB) |
Define mono camera sources for stereo depth
1 2 3 4 5 6 | mono_left = pipeline.createMonoCamera() mono_left.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P) mono_left.setBoardSocket(dai.CameraBoardSocket.LEFT) mono_right = pipeline.createMonoCamera() mono_right.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P) mono_right.setBoardSocket(dai.CameraBoardSocket.RIGHT) |
Create stereo node
1 | stereo = pipeline.createStereoDepth() |
Linking mono cam outputs to stereo node
1 2 | mono_left.out.link(stereo.left) mono_right.out.link(stereo.right) |
Use blobconverter to get the blob of the required model
We use the blobconverter to compile and download the model defined earlier from the selected model zoo, ‘depthai’ or ‘intel’.
We also specify the ‘shaves’ parameter. This tells the blobconverter to compile the model to run on the specified number of ‘shaves’.
The ‘shaves’ argument in blobconverter determines the number of SHAVE cores used to compile the neural network. The higher the value, the faster network can run.
1 2 3 4 5 6 | if model_name is not None : blob_path = blobconverter.from_zoo( name = model_name, shaves = 6 , zoo_type = zoo_type ) |
What are SHAVES?
The SHAVES are vector processors in DepthAI/OAK.
Other than running the neural network, these SHAVES are also used for other things in the device, like handling the reformatting of images, doing some ISP, etc.
So, there is a limit to how many SHAVES you can use at once. The higher the resolution, the more SHAVES are consumed.
- For 1080p, 13 SHAVES (of 16) are free for neural network stuff.
- For 4K sensor resolution, 10 SHAVES are available for neural operations.
Define face detection NN node
1 2 3 4 5 | face_spac_det_nn = pipeline.createMobileNetSpatialDetectionNetwork() face_spac_det_nn.setConfidenceThreshold( 0.75 ) face_spac_det_nn.setBlobPath(blob_path) face_spac_det_nn.setDepthLowerThreshold( 100 ) face_spac_det_nn.setDepthUpperThreshold( 5000 ) |
Define face detection input config
Preprocess the image frame for the Neural Network input. For that, we use the ImageManip node.
ImageManip is the node that can apply different transformations on the input image and give the transformed image as the output.
Here, this node is used to resize the image frame coming from the camera to the dimensions that our model accepts. We will learn more in-depth about this and other nodes in a later post on Creating a Complex Pipeline using DepthAI.
1 2 3 | face_det_manip = pipeline.createImageManip() face_det_manip.initialConfig.setResize(DET_INPUT_SIZE[ 0 ], DET_INPUT_SIZE[ 1 ]) face_det_manip.initialConfig.setKeepAspectRatio( False ) |
Linking
We link the RGB camera output to the ImageManip Node, the output of the ImageManip node to the Neural Network input, and the stereo depth output to the NN node.
1 2 | cam.preview.link(face_det_manip.inputImage) face_det_manip.out.link(face_spac_det_nn. input )stereo.depth.link(face_spac_det_nn.inputDepth) |
Create preview output
Create a stream to get the output from the camera
1 2 3 | x_preview_out = pipeline.createXLinkOut() x_preview_out.setStreamName( "preview" ) cam.preview.link(x_preview_out. input ) |
Create detection output
Create a stream to get the output from the Neural Network
1 2 3 | det_out = pipeline.createXLinkOut() det_out.setStreamName( 'det_out' ) face_spac_det_nn.out.link(det_out. input ) |
Define display function
We define a function to display info on the image frame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | def display_info(frame, bbox, coordinates, status, status_color, fps): # Display bounding box cv2.rectangle(frame, bbox, status_color[status], 2 ) # Display coordinates if coordinates is not None : coord_x, coord_y, coord_z = coordinates cv2.putText(frame, f "X: {int(coord_x)} mm" , (bbox[ 0 ] + 10 , bbox[ 1 ] + 20 ), cv2.FONT_HERSHEY_TRIPLEX, 0.5 , 255 ) cv2.putText(frame, f "Y: {int(coord_y)} mm" , (bbox[ 0 ] + 10 , bbox[ 1 ] + 35 ), cv2.FONT_HERSHEY_TRIPLEX, 0.5 , 255 ) cv2.putText(frame, f "Z: {int(coord_z)} mm" , (bbox[ 0 ] + 10 , bbox[ 1 ] + 50 ), cv2.FONT_HERSHEY_TRIPLEX, 0.5 , 255 ) # Create background for showing details cv2.rectangle(frame, ( 5 , 5 , 175 , 100 ), ( 50 , 0 , 0 ), - 1 ) # Display authentication status on the frame cv2.putText(frame, status, ( 20 , 40 ), cv2.FONT_HERSHEY_SIMPLEX, 0.5 , status_color[status]) # Display instructions on the frame cv2.putText(frame, f 'FPS: {fps:.2f}' , ( 20 , 80 ), cv2.FONT_HERSHEY_SIMPLEX, 0.6 , ( 255 , 255 , 255 )) |
Define some variables that we will use in the main loop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Frame count frame_count = 0 # Placeholder fps value fps = 0 # Used to record the time when we processed last frames prev_frame_time = 0 # Used to record the time at which we processed current frames new_frame_time = 0 # Set status colors status_color = { 'Face Detected' : ( 0 , 255 , 0 ), 'No Face Detected' : ( 0 , 0 , 255 ) } |
Main Loop
We start the pipeline and acquire video frames from the “preview
” queue and get the NN outputs (detections and bounding box mapping) from the “det_out
” queue.
Once we have the outputs, we display the spacial information and bounding box on the image frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | # Start pipeline with dai.Device(pipeline) as device: # Output queue will be used to get the right camera frames from the outputs defined above q_cam = device.getOutputQueue(name = "preview" , maxSize = 1 , blocking = False ) # Output queue will be used to get nn data from the video frames. q_det = device.getOutputQueue(name = "det_out" , maxSize = 1 , blocking = False ) # # Output queue will be used to get nn data from the video frames. # q_bbox_depth_mapping = device.getOutputQueue(name="bbox_depth_mapping_out", maxSize=4, blocking=False) while True : # Get right camera frame in_cam = q_cam.get() frame = in_cam.getCvFrame() bbox = None coordinates = None inDet = q_det.tryGet() if inDet is not None : detections = inDet.detections # if face detected if len (detections) is not 0 : detection = detections[ 0 ] # Correct bounding box xmin = max ( 0 , detection.xmin) ymin = max ( 0 , detection.ymin) xmax = min (detection.xmax, 1 ) ymax = min (detection.ymax, 1 ) # Calculate coordinates x = int (xmin * FRAME_SIZE[ 0 ]) y = int (ymin * FRAME_SIZE[ 1 ]) w = int (xmax * FRAME_SIZE[ 0 ] - xmin * FRAME_SIZE[ 0 ]) h = int (ymax * FRAME_SIZE[ 1 ] - ymin * FRAME_SIZE[ 1 ]) bbox = (x, y, w, h) # Get spacial coordinates coord_x = detection.spatialCoordinates.x coord_y = detection.spatialCoordinates.y coord_z = detection.spatialCoordinates.z coordinates = (coord_x, coord_y, coord_z) # Check if a face was detected in the frame if bbox: # Face detected status = 'Face Detected' else : # No face detected status = 'No Face Detected' # Display info on frame display_info(frame, bbox, coordinates, status, status_color, fps) # Calculate average fps if frame_count % 10 = = 0 : # Time when we finish processing last 100 frames new_frame_time = time.time() # Fps will be number of frame processed in one second fps = 1 / ((new_frame_time - prev_frame_time) / 10 ) prev_frame_time = new_frame_time # Capture the key pressed key_pressed = cv2.waitKey( 1 ) & 0xff # Stop the program if Esc key was pressed if key_pressed = = 27 : break # Display the final frame cv2.imshow( "Face Cam" , frame) # Increment frame count frame_count + = 1 cv2.destroyAllWindows() |
7. Output


8. Conclusion
This is all about how you can incorporate any pre-trained model in our pipeline.
The next post of this series will explore the other Pipeline nodes available to us and how they can be used together to create complex pipelines.