Detecting small objects in aerial imagery, particularly for critical applications like sea rescue, presents unique challenges. Timely detection of people in the water can mean the difference between life and death. Our experiment focuses on fine-tuning Faster R-CNN, a robust two-stage object detector, to address this vital need.
Central to our study is the SeaDroneSee dataset, a vital collection of images for training models to identify seafarers in distress. We enhance the model’s learning by preprocessing images into patches, allowing it to focus on smaller, detailed regions and significantly improving detection accuracy. Additionally, we explore the synergy between this approach and the advanced slicing technique of SAHI, comparing their effectiveness.
Our approach emphasizes the importance of data preprocessing advanced post-processing techniques. By tailoring these steps to the specific challenges of small object detection, we aim to achieve top-tier results and push the boundaries of aerial imagery analysis.
Join us as we explore this exciting application of fine-tuning Faster R-CNN for a life-saving cause!
- Why Fine-tuning Faster R-CNN in 2024?
- Understanding the Dataset
- Patch Creation: As a Preprocessing Technique
- Code Walkthrough : Fine-tuning Faster R-CNN
- Dataclass Preparation
- Training Configuration
- Predictions
- Combining SAHI with fine-tuned Faster R-CNN
- Comparison of Faster RCNN Detection with SAHI v/s Without SAHI v/s with Patches as Input
- Key Takeaways
- Conclusion
- References
Why Fine-tuning Faster R-CNN in 2024?
Despite the emergence of state-of-the-art, very accurate, or reduced-latency object detection algorithms, Faster R-CNN remains one of the robust choices for detecting small and fine image details, which is in synergy with our application.
Faster R-CNN utilizes a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enhancing the efficiency and accuracy of generating potential object-bound boxes. This shared mechanism is particularly beneficial for capturing small objects because it allows the network to dedicate more processing power to subtle features and distinctions in smaller regions of interest. Consequently, this makes Faster R-CNN adept at handling scenarios where objects of interest are small and require precise localization, essential in scenarios like monitoring and detecting objects in expansive sea landscapes.
You can give a quick read to know more about how the Region Proposal Network(RPN) in Faster-RCNN works.
At CVPR 2024, RCNN paper was recognised as most impactful paper for its legacy [Source],
- Longuet-Higgins Prize – Awarded to a paper that has withstood the test of time, the 2024 Longuet-Higgins Prize recognizes the CVPR paper from 2014 with the most impact.
Understanding the Dataset
Unmanned Aerial Vehicles (UAVs) are fast to deploy, relatively inexpensive, and pose much less risk compared to traditional methods. Equipped with various sensors, they provide a comprehensive overview of the scene and can cover large areas autonomously to search for objects or people.
This project aims to develop a UAV to assist in humanitarian Search and Rescue scenarios. It is a collaborative effort between Collins Aerospace and the University of Tübingen. Using onboard vision sensors and telemetry data, neural networks will aid in searching for objects of interest and reporting detected anomalies to operators at the ground station.
Below is an illustration of the components involved in this innovative solution.
Imagine a scenario where a drone soared over the ocean, searching for survivors. That’s the goal behind SeaDronesSee, a massive dataset designed to train computer vision systems for Search and Rescue (SAR) missions.
This dataset is like a training ground for embedded computer vision. It contains real-world video footage of maritime environments, where the challenge is to spot people in the water.
Intrigued by drone programming for computer vision? Check out our essential guide!
SeaDronesSee is split into three sections:
- Object Detection: This teaches the system to identify objects like people in the ocean’s vastness.
- Single-Object Tracking: Once a person is spotted, the system learns to follow them, even if they move around.
- Multi-Object Tracking: There might be multiple survivors in a real SAR mission. This section trains the system to track them all simultaneously
By analyzing this data, drones become more adept at assisting SAR missions, making them smarter lifesavers.
This article focuses on the object detection v2 subset of the SeaDroneSee dataset which contains:
- 8930 train
- 1547 val
- 3750 test
A key challenge in this type of dataset is achieving accurate identification with labels for objects, especially since many classes are quite small and difficult to detect.
A note that the image dimensions aren’t uniform throughout the dataset.
Here are the image sizes of the dataset (W, H):
- (5436,3632)
- (3840,2160)
- (1230,932)
- (1231,933)
- (3632,5456)
- (1920,1080)
Classes:
0: ‘ignored’, 1: ‘swimmer’, 2: ‘boat’, 3: ‘jetski’, 4: ‘life_saving_appliances’, 5: “buoy”
A “ignore” region contains objects that are difficult to annotate due to low resolution, crowds, or are unwanted in the dataset.
We also observe that this dataset is imbalanced, with a significant class distribution discrepancy between small object instances such as swimmers, buoys, and life-saving appliances.
Patch Creation: As a Preprocessing Technique
In our dataset, each image is of 4k high resolution. These high-res pose challenges due to their sheer size, which can shoot up the computational resources and memory capacity. By dividing these images into patches and saving them, we can process those smaller sections independently, thus reducing the computational load and enabling the model to focus on finer details. This is particularly beneficial in our use case of detecting small objects, such as swimmers or boats at very distant points in the vast ocean.
In our approach, we utilize a patch overlap ratio of 0.2. This overlap ensures that no critical information is lost between patches. By having overlapping regions, the model can learn from multiple perspectives of the same area and learn the distinguishing features of these objects. Furthermore, patch creation allows for increased data augmentation, effectively increasing the number of training samples.
Now guess what? The small object detection problem has become a typical object detection problem. That sounds pretty intuitive, right?
Image patch creation mirrors the operation of Convolutional Neural Networks (CNNs) as both techniques involve processing localized areas of an image to extract and learn feature representations effectively.
To access the code featured in this article and try fine-tuning Faster R-CNN with Pytorch yourself, simply fill details in the “Download Source Code” banner.
Code Walkthrough : Fine-tuning Faster R-CNN
Let’s start with downloading our dataset from Kaggle with a bash command as follows:
# !pip install -qq torch torchvision kaggle
#!sudo apt-get install unzip -y
!sudo apt-get install tree
!kaggle datasets download -d ubiratanfilho/sds-dataset
The downloaded dataset is structured like this:
compressed
├── annotations
│ ├── instances_train.json
│ └── instances_val.json
├── images
│ ├── train
│ └── val
└── test
The instances_train.json
and instances_val.json
will contain the image IDs with the drone’s (camera source) corresponding metadata about the drone, such as latitude, longitude, speed, etc. Along with these metadata, other annotations, like bounding boxes and class or category IDs, also exist, which are our choice of interest.
"annotations": [{"id": 14579, "image_id": 3388, "bbox": [3619, 1409, 75, 38], "area": 2850, "category_id": 2}, {"id": 14581, "image_id": 3389, "bbox": [3524, 1408, 73, 37], "area": 2701, "category_id": 2},
Installing Dependencies
We will set up our model fine-tuning pipeline using torchvision library and torchmetrics
with pycocotools
for calculating the evaluation metrics.
# !pip install -qq torchvision
# !pip install -qq torch
!pip install -qq torchmetrics[detection]
!pip install -qq pycocotools
!pip install -qq tensorboard
To adapt our training code and utilities for torchvision object detection, we will simply clone the official torchvision repository.
!git clone https://github.com/pytorch/vision.git #Training Metric Utilities from Torchvision
Import Libraries
Then, necessary libraries are imported.
import os
import gc
import json
import math
import random
import requests
import zipfile
import numpy as np
from PIL import Image, ImageDraw, ImageFont, ImageOps, ImageStat
import PIL
import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Patch
import logging
from tqdm import tqdm
from torchmetrics.detection.mean_ap import MeanAveragePrecision
from dataclasses import dataclass
import torchvision
from vision.references.detection import utils
import torchvision.transforms as T
from torchvision.transforms import v2 as Tv2
from torchvision import tv_tensors
from torchvision.transforms import functional as F
from torchvision.transforms.functional import to_pil_image
import torchvision.models.detection as detection
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.transform import GeneralizedRCNNTransform
Let’s set the seed for reproducibility.
def set_seeds():
# fix random seeds
SEED_VALUE = 42
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)
if torch.cuda.is_available():
torch.cuda.manual_seed(SEED_VALUE)
torch.cuda.manual_seed_all(SEED_VALUE)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True
set_seeds()
Downloading Patched Dataset
To save time and compute directly, download the patches we created by running the following script. This script can be used if you want to skip the dataset preprocessing function for patch creation.
if not os.path.exists('SeaDroneSee'):
os.mkdir('SeaDroneSee')
!wget -O SeaDroneSee/SeaDroneSee.zip "https://www.dropbox.com/scl/fi/0oyv9pki57laqgmq7matd/SeaDroneSee.zip?rlkey=yasyxr0u3450dylv5musks1s0&st=q12t3tc3&dl=1"
!wget -O SeaDroneSee/SeaDroneSee_test.zip "https://www.dropbox.com/scl/fi/4qidpahgu9mogam33uxlz/SeaDroneSee_test.zip?rlkey=1gt6mebuppxg4ehzhicwqafav&st=5g01mcdb&dl=1"
def download_file(url, save_name):
if not os.path.exists(save_name):
# Handling potential redirection in requests
with requests.get(url, allow_redirects=True) as r:
if r.status_code == 200:
with open(save_name, 'wb') as f:
f.write(r.content)
else:
print("Failed to download the file, status code:", r.status_code)
def unzip(zip_file=None, target_dir=None):
try:
with zipfile.ZipFile(zip_file, 'r') as z:
z.extractall(target_dir)
print("Extracted all to:", target_dir)
except zipfile.BadZipFile:
print("Invalid file or error during extraction: Bad Zip File")
except Exception as e:
print("An error occurred:", e)
save_path = 'SeaDroneSee/SeaDroneSee.zip'
model_ckpt_url = 'https://www.dropbox.com/scl/fi/xmftrum0a8rgjp82j6n65/model_ckpt.zip?rlkey=aywwl28rbcbiejggdps0durfu&st=dda61bld&dl=1'
model_save_path = 'SeaDroneSee/Model_ckpt.zip'
download_file(model_ckpt_url, model_save_path)
unzip(zip_file=model_save_path, target_dir='SeaDroneSee') # Specify target directory for the model checkpoint
unzip(zip_file=save_path)
test_save_path= 'SeaDroneSee/SeaDroneSee_test.zip'
unzip(zip_file = test_save_path, target_dir='SeaDroneSee')
To create patches with your choice of patch size, overlap ratio, and the number of patches to be saved, continue enjoying the process of understanding the following Patch creation code sections.
Utilities
Let’s do a class mapping and assign a unique color for each label or class ID.
classes_to_idx = {
0: 'ignored',
1: 'swimmer',
2: 'boat',
3: 'jetski',
4: 'life_saving_appliances',
5: "buoy"
}
# Mapping category IDs to colors
category_colors = {
0: 'black', # ignored
1: 'red', # swimmer
2: 'orange', # boat
3: 'blue', # jetski
4: 'purple', # life saving appliances
5: 'yellow' # buoy
}
Understanding the dataset is a crucial step in any deep-learning task. Hence, we will spend a good amount of time here on some preprocessing techniques for this dataset.
To inspect and visualize the ground truth annotations, let’s define draw_bounding_boxes
utility.
One of the key aspects in object detection is the bounding box format, which can hamper our fine-tuning Faster RCNN pipeline if not properly handled. As our dataset annotations are in XYWH
format, we need to convert them to XYXY
, which is the expected format of PIL’s Image Draw function.
def draw_bounding_boxes(image, bboxes):
draw = ImageDraw.Draw(image)
font_size = int(min(image.size) * 0.02) # Adjust font size based on image size
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"
font = ImageFont.truetype(font_path, font_size) if os.path.exists(font_path) else ImageFont.load_default()
for bbox, category_id in bboxes:
x, y, w, h = bbox
x1, y1, x2, y2 = x, y, x + w, y + h
color = category_colors.get(category_id, 'white') # Default to white if category_id is unknown
draw.rectangle([x1, y1, x2, y2], outline=color, width=4)
draw.text((x1, y1 - font_size), str(category_id), fill=color, font=font)
return image
The load_annotations
utility takes the path of instances_train.json
or instaces_val.json
to load them and returns their annotations.
def load_annotations(annotation_path):
with open(annotation_path, 'r') as f:
annotations = json.load(f)
return annotations
After loading them, we will iterate over each bounding box with their image ID from the annotations.Then using matplotlib, we will plot the training and validation images with their ground truth annotations for an arbitrary number of samples of our choice.
def visualize_samples(image_dir, annotation_path, num_samples=5):
annotations = load_annotations(annotation_path)
images_info = annotations['images']
bboxes_info = annotations['annotations']
images_with_bboxes = {}
for bbox in bboxes_info:
image_id = bbox['image_id']
if image_id not in images_with_bboxes:
images_with_bboxes[image_id] = []
images_with_bboxes[image_id].append((bbox['bbox'], bbox['category_id']))
# Shuffle list of images
random.shuffle(images_info)
# Visualize samples
plt.figure(figsize=(15, num_samples * 5))
sample_count = 0
for image_info in images_info:
if sample_count >= num_samples:
break
image_path = os.path.join(image_dir, image_info['file_name'])
if not os.path.exists(image_path):
continue # Skip this image if the file does not exist
image = Image.open(image_path)
image_id = image_info['id']
# print(f"Img ID: {image_id} Image Dimension: {image.size}")
if image_id in images_with_bboxes:
bboxes = images_with_bboxes[image_id]
image = draw_bounding_boxes(image, bboxes)
plt.subplot(num_samples, 1, sample_count + 1)
plt.imshow(image)
plt.axis('off')
plt.title(f"Image ID: {image_id}")
sample_count += 1
plt.tight_layout()
plt.show()
Data Cleaning
Let’s inspect the entire train and val image samples with their ground truth annotation plotted and save them to our disk for manual visual inspection.
# Directories
image_dir = 'compressed/images/train'
output_dir = 'compressed/train_gt/bbox_ann_images'
annotation_path = 'compressed/annotations/instances_train.json'
# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Process all images according to annotations
def process_and_save_annotated_images(image_dir, output_dir, annotations):
image_annotations = {}
for annot in annotations['annotations']:
image_annotations.setdefault(annot['image_id'], []).append(annot)
for image_id, annots in image_annotations.items():
image_path = os.path.join(image_dir, f"{image_id}.jpg")
if os.path.exists(image_path):
image = Image.open(image_path)
annotated_image = draw_bounding_boxes(image, annots, category_colors) # Pass all annotations for the image
output_image_path = os.path.join(output_dir, f"annotated_{image_id}.jpg")
annotated_image.save(output_image_path)
print(f"Saved annotated image to {output_image_path}")
# Load annotations
annotations = load_annotations(annotation_path)
# Annotate and save images
process_and_save_annotated_images(image_dir, output_dir, annotations)
By closely observing all the images we have saved, it was found that around 73 samples of moving objects had their ground truth bounding box annotations offset or were wrong. So, we will manually remove them by their filenames, as this will introduce noise into the fine-tuning Faster R-CNN model.
A famous principle in Deep learning is to address “Garbage In Garbage Out” (the GIGO principle).
# List of file names to remove
train_file_remove_list = [
3391, 3392, 3393, 3413, 3414, 3415, 3416, 3417, 6952, 6957, 7002, 7000, 6999, 7023, 7046, 7093, 7092, 7091,
7527, 7558, 7611, 7631, 7987, 7988, 8091, 8097, 8098, 8099, 8113, 8114, 8422, 8438, 8441, 8443, 10246,
10260, 10263, 10264, 10265, 10266, 10269, 10271, 10330, 10348, 10369, 10368, 10379, 11785, 11814, 11828,
11862, 11865, 11869, 11877, 11887, 11891, 11908, 11910, 12001, 12003, 12195, 12312, 12327, 12332, 12417,
13035, 15809, 15808, 15913, 15914, 16140, 16270, 16271]
val_file_remove_list = [10465]
# Directory containing the images
image_dir = './compressed/images/train'
# Iterate over the file list and attempt to remove each file
for file_id in train_file_remove_list:
file_path = os.path.join(image_dir, f"{file_id}.jpg")
if os.path.exists(file_path):
try:
os.remove(file_path)
print(f"Removed: {file_path}")
except OSError as e:
print(f"Error removing {file_path}: {e}")
else:
print(f"File does not exist: {file_path}")
Data Preprocessing: Patch Creation
As discussed earlier, we will ignore bounding box annotations inside masked cutouts in some images. We’ll do this by averaging the color of pixels within the bounding box inside the masked cutout area, thus excluding these bounding boxes from the annotation file.
def is_bbox_ignored(image, bbox, threshold=10):
"""Check if the entire region inside a bounding box is predominantly black.
Args:
image (PIL.Image): The image to check.
bbox (list): The bounding box [x, y, width, height].
threshold (int): The threshold below which a region is considered black.
Returns:
bool: True if the region is predominantly black, False otherwise.
"""
x, y, w, h = bbox
cropped_image = image.crop((x, y, x + w, y + h))
stat = ImageStat.Stat(cropped_image)
avg_color = stat.mean # Average color (R, G, B)
# Check if all color channels are below the threshold
return all(channel < threshold for channel in avg_color)
To maintain a consistent aspect ratio of width greater than height across all images in the data loader, we will convert images with a height greater than their width to have a width greater than their height.
If an image has a height greater than its width, then using PIL, we will rotate the image by 90 degrees anticlockwise and pad the remaining regions if necessary to maintain the aspect ratio using the expand=True parameter.
def check_and_rotate_image(image):
"""Rotate the image if its height is greater than its width and return the image and a flag indicating rotation."""
width, height = image.size
if height > width:
image = image.rotate(90, expand=True) # Rotates 90 counter-clockwise
return image, True
return image, False
The following is a crucial step in which the bbox is adjusted according to the rotation of the image. As we rotate the image anticlockwise, the new dimensions are adapted relatively.
def adjust_bbox_for_rotation(bbox, image_width, image_height):
"""Adjust bounding boxes for 90 degree counter clockwise rotation."""
x, y, w, h = bbox
new_x = y
new_y = image_width - (x + w)
new_w = h
new_h = w
return [new_x, new_y, new_w, new_h]
def rotate_image_and_adjust_bbox(image, annotations, original_dims):
"""Rotate image and adjust bounding boxes accordingly."""
rotated_image = image.rotate(90, expand=True)
new_annotations = []
original_width, original_height = original_dims
for ann in annotations:
x, y, w, h = ann['bbox']
new_x = y
new_y = original_width - (x + w)
new_w = h
new_h = w
new_ann = ann.copy()
new_ann['bbox'] = [new_x, new_y, new_w, new_h]
new_annotations.append(new_ann)
return rotated_image, new_annotations
Next, the main aspect of our preprocessing is the patch creation logic. We understood this earlier intuitively; now, let’s implement it in code.
This function will create patches whose dimensions are half the size of the image with a 0.2 overlap ratio. By sliding over the images, we will get four patches. These patches are then saved, and their coordinate positions (left, top, right, bottom) are returned. This is essential to adjust the bounding boxes relative to each patch from the original image.
def create_patches(image, output_dir, image_filename, overlap_ratio=0.2):
"""Create image patches and handle image rotation if necessary."""
image, was_rotated = check_and_rotate_image(image)
width, height = image.size
patch_width = int(width / 2)
patch_height = int(height / 2)
overlap_width = int(patch_width * overlap_ratio)
overlap_height = int(patch_height * overlap_ratio)
patches = []
for i in range(2): # Three rows
for j in range(2): # Three columns
left = i * (patch_width - overlap_width)
top = j * (patch_height - overlap_height)
right = left + patch_width
bottom = top + patch_height
patch = image.crop((left, top, right, bottom))
patch_filename = f'{os.path.splitext(image_filename)[0]}_{i}_{j}.jpg'
patch_path = os.path.join(output_dir, patch_filename)
patch.save(patch_path)
patches.append((patch_filename, left, top, right, bottom, was_rotated))
return patches
These bounding boxes are then adjusted with respect to the patch coordinates. We clamp these values to ensure the new annotations are not beyond the patch dimension. We also ensure no non-positive bounding box values will be saved.
def adjust_bbox_for_patch(bbox, patch_coords):
"""Adjust the bounding box to the coordinates of the patch with enhanced error handling."""
left, top, right, bottom = patch_coords
x, y, w, h = bbox
x1, y1, x2, y2 = x, y, x + w, y + h
logging.debug(f"Original bbox: {bbox}")
logging.debug(f"Patch coordinates: {patch_coords}")
# Ensure the bounding box intersects with the patch
if x2 <= left or x1 >= right or y2 <= top or y1 >= bottom:
# logging.warning("Bounding box does not intersect with the patch.")
return None # No intersection
# Clamp the bounding box to the patch boundaries
clamped_x1 = max(x1, left)
clamped_y1 = max(y1, top)
clamped_x2 = min(x2, right)
clamped_y2 = min(y2, bottom)
adjusted_width = clamped_x2 - clamped_x1
adjusted_height = clamped_y2 - clamped_y1
# Check for non-positive dimensions
if adjusted_width <= 0 or adjusted_height <= 0:
logging.warning("Adjusted bounding box has non-positive dimensions.")
return None
# Check if adjusted bounding box exceeds patch size
if adjusted_width > (right - left) or adjusted_height > (bottom - top):
logging.warning("Adjusted bounding box exceeds patch dimensions.")
return None
adjusted_bbox = [clamped_x1 - left, clamped_y1 - top, adjusted_width, adjusted_height]
logging.debug(f"Adjusted bbox: {adjusted_bbox}")
return adjusted_bbox
The following function combines all the annotation utilities and returns the instance annotations within each patch.
def get_annotations_for_patches(image, annotations, patches, original_image_id):
"""Adjust annotations for each patch."""
patch_annotations = []
annotation_id = 0
for patch_filename, left, top, right, bottom, was_rotated in patches:
patch_coords = (left, top, right, bottom)
patch_annots = []
for ann in annotations:
if ann['image_id'] != original_image_id:
continue
bbox = ann['bbox']
if was_rotated:
bbox = adjust_bbox_for_rotation(bbox, right - left, bottom - top)
# Check if the bbox should be ignored
if is_bbox_ignored(image, bbox):
continue
adjusted_bbox = adjust_bbox_for_patch(bbox, patch_coords)
if adjusted_bbox:
new_ann = {
"id": annotation_id,
"image_id": patch_filename,
"bbox": adjusted_bbox,
"area": (adjusted_bbox[2] * adjusted_bbox[3]),
"category_id": ann['category_id']
}
patch_annots.append(new_ann)
annotation_id += 1
if patch_annots:
patch_annotations.extend(patch_annots)
return patch_annotations
Now, it’s time to integrate all these preprocessing steps. We will start by reading the annotation files, iterating over them, rotating the image dimensions, creating patches, and finally adjusting the bounding boxes and saving them to an output directory for both the training and validation sets.
def process_images_and_annotations(base_dir):
annotation_files = ['instances_train.json','instances_val.json']
image_dirs = ['train','val']
all_new_annotations = {"annotations": []}
for annotation_file, image_dir in zip(annotation_files, image_dirs):
annotation_path = os.path.join(base_dir, 'annotations', annotation_file)
with open(annotation_path, 'r') as f:
annotations = json.load(f)
for image_info in annotations['images']:
image_filename = image_info['file_name']
image_path = os.path.join(base_dir, 'images', image_dir, image_filename)
if not os.path.exists(image_path):
continue
original_dims = (image_info['width'], image_info['height'])
image = Image.open(image_path)
if image_info['height'] > image_info['width']:
rotated_image, image_annotations = rotate_image_and_adjust_bbox(image.copy(), annotations['annotations'], original_dims)
else:
rotated_image = image.copy()
image_annotations = annotations['annotations']
output_dir = os.path.join(base_dir, 'output_patches', 'images', image_dir)
os.makedirs(output_dir, exist_ok=True)
patches = create_patches(rotated_image, output_dir, image_filename)
new_annotations = get_annotations_for_patches(rotated_image, image_annotations, patches, image_info['id'])
all_new_annotations["annotations"].extend(new_annotations)
annotation_dir = os.path.join(base_dir, 'output_patches', 'annotations')
os.makedirs(annotation_dir, exist_ok=True)
annotations_output_path = os.path.join(base_dir, 'output_patches', 'annotations', f'instances_patches_{image_dir}.json')
with open(annotations_output_path, 'w') as f:
json.dump(all_new_annotations, f, indent=4)
base_dir = './compressed/'
process_images_and_annotations(base_dir)
Now, within the instances_patches_train.json
and instances_patches_val.json
the annotations for the four patches from a single image look like this:
"annotations": [
{"id": 0,"image_id": "3390_0_1.jpg","bbox": 1863, 542,57, 36 ], "area": 2052, "category_id": 2},
{ "id": 0,"image_id": "3399_0_2.jpg","bbox": [1731,288, 70,35 ], "area": 2450, "category_id": 2 },
Dataclass Preparation
This line defines a dataclass named DatasetConfig for storing configuration parameters for a dataset.
@dataclass
class DatasetConfig:
root: str
annotations_file: str
train_img_size: tuple
subset: str = 'train' # Default to 'train'
transforms: any = None
Here the CustomAerialDataset
class is designed to handle aerial image datasets and perform tasks such as loading images, handling annotations, and preparing data for fine-tuning Faster R-CNN model.
Below is a brief overview of its key functionalities:
- The class takes a
DatasetConfig
object that contains the root directory, image size, subset (train/val/test), and any transformations. - It initializes paths for images and annotations and calls methods to load them.
class CustomAerialDataset(Dataset):
def __init__(self, config: DatasetConfig):
self.root = config.root
self.transforms = config.transforms
self.train_img_size = config.train_img_size
self.subset = config.subset
self.annotations_file = os.path.join(self.root, 'annotations', f'instances_patches_{self.subset}.json')
self.imgs = []
self.img_annotations = {}
self._load_images()
self._load_annotations()
def __len__(self):
return len(self.imgs)
- The
_load_images
method scans the specified subset directory and appends valid image file paths to theimgs
list. - Each image is given an empty annotation initially.
class CustomAerialDataset(Dataset):
...
def _load_images(self):
# Load all images from the subset directory
images_path = os.path.join(self.root, 'images', self.subset)
for image_filename in os.listdir(images_path):
image_path = os.path.join(images_path, image_filename)
if os.path.isfile(image_path) and image_path.endswith(('.png', '.jpg', '.jpeg')):
self.imgs.append(image_path)
image_id = os.path.basename(image_path)
# Initialize empty annotations for each image
if image_id not in self.img_annotations:
self.img_annotations[image_id] = {'boxes': [], 'labels': []}
- Then the
_load_annotations
method reads a JSON file containing bounding box annotations. - It matches each annotation to its corresponding image and stores the bounding box coordinates and category IDs.
class CustomAerialDataset(Dataset):
...
def _load_annotations(self):
with open(self.annotations_file, 'r') as f:
data = json.load(f)
for annotation in data['annotations']:
image_id = annotation["image_id"]
bbox = annotation["bbox"]
category_id = annotation["category_id"]
image_path = os.path.join(self.root, 'images', self.subset, image_id)
if image_id in self.img_annotations:
self.img_annotations[image_id]['boxes'].append(bbox)
self.img_annotations[image_id]['labels'].append(category_id)
- The
__getitem__
method retrieves an image and its annotations by index. - It handles missing images by returning a zero tensor and a dummy target if there are no instances within an image. This is done because while enumerating a batch of image and its corresponding target the loss calculation expects a target of shape (N,4). Suppose if we just pass the image without an instance this will throw an error like expected a tensor of shape (N,4) but received torch.size([0]).
- Then the image is resized to the specified dimensions, and bounding boxes are scaled accordingly to reduce the training time and GPU hours.
- The target here will be a dictionary of box tensors and their equivalent label tensor of data type
float32
andint64
respectively. - Whatever the transformations by torch.transforms.functional will be applied to both image and target to ensure valid augmentation is performed to improve model performance.
class CustomAerialDataset(Dataset):
...
def __getitem__(self, idx):
img_path = self.imgs[idx]
if not os.path.exists(img_path):
# Return a default image (like a zero tensor) and a dummy target
default_img = torch.zeros(3, *self.train_img_size) # Assuming 3 color channels
default_target = {'boxes': torch.tensor([[0, 0, 0, 0]], dtype=torch.float32),
'labels': torch.tensor([0], dtype=torch.int64)} # Background
return default_img, default_target
img = Image.open(img_path).convert("RGB")
orig_width, orig_height = img.size
scale_x = self.train_img_size[0] / orig_width
scale_y = self.train_img_size[1] / orig_height
img = img.resize(self.train_img_size, Image.BILINEAR)
img = F.to_tensor(img)
annotations = self.img_annotations[os.path.basename(img_path)]
if annotations['boxes']:
scaled_boxes = [[max(0, min(bbox[0] * scale_x, self.train_img_size[0])),
max(0, min(bbox[1] * scale_y, self.train_img_size[1])),
max(0, min((bbox[0] + bbox[2]) * scale_x, self.train_img_size[0])),
max(0, min((bbox[1] + bbox[3]) * scale_y, self.train_img_size[1]))]
for bbox in annotations['boxes']]
labels = annotations['labels']
else:
scaled_boxes = [[0, 0, 0, 0]]
labels = [0]
boxes = torch.tensor(scaled_boxes, dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.int64)
target = {'boxes': boxes, 'labels': labels}
if self.transforms:
img, target = self.transforms(img, target)
return img, target
If transformations are specified, they are applied to both the image and its annotations before returning.
def get_transform(train):
transforms = []
# if train:
# transforms.append(Tv2.RandomHorizontalFlip(0.5))
transforms.append(Tv2.ToDtype(torch.float, scale=True))
transforms.append(Tv2.ToPureTensor())
return Tv2.Compose(transforms)
The CustomAerialDataset class we defined provides a robust framework for preparing the data loader, ensuring that images and annotations are correctly loaded and formatted for model training.
Then, the train and validation configurations are initialized. We will resize our train and validation to an image size of (384,216) i.e., (W, H).
root = "SeaDroneSee/output_patches"
# Configuration for training and validation datasets
train_config = DatasetConfig(root,
annotations_file='', # This is now set based on subset in the __init__
train_img_size=(384, 216),
subset='train',
transforms=get_transform(train=True))
val_config = DatasetConfig(root,
annotations_file='',
train_img_size=(384, 216),
subset='val',
transforms=get_transform(train=False))
train_dataset = CustomAerialDataset(train_config)
val_dataset = CustomAerialDataset(val_config)
print(f"Length of Train Dataset: {len(train_dataset)}")
print(f"Length of Validation Dataset: {len(val_dataset)}")
After the patch creation, there are 35270 train images and 6188 validation images which will be the final set of input images for fine-tuning Faster R-CNN model.
Now, let’s define a custom collate function
to handle images without annotations. We need to pass these empty instance images as well as they can improve the model’s performance and avoid False Positives (Here, Background
is misidentified as an object instance).
def collate_fn(batch):
imgs, targets = zip(*batch)
imgs = torch.stack(imgs, dim=0)
real_targets = []
for target in targets:
# Filter out dummy boxes
mask = target['boxes'].sum(dim=1) > 0
real_targets.append({'boxes': target['boxes'][mask], 'labels': target['labels'][mask]})
return imgs, real_targets
train_data_loader = DataLoader(train_dataset, batch_size=10, shuffle=True, collate_fn=collate_fn, num_workers=12)
val_data_loader = DataLoader(val_dataset, batch_size=10, shuffle=False, collate_fn=collate_fn, num_workers=12)
Let’s visualize samples from the train_data_loader to check whether our custom data dataset class is properly defined.
def show_image_with_boxes(img, targets, ax, category_colors):
"""Plot an image with its bounding boxes on an axis object."""
# Convert tensor image to PIL for display if needed
if isinstance(img, torch.Tensor):
img = to_pil_image(img)
print(img.size)
ax.imshow(img)
# Check and plot each bounding box with class-specific color
if 'boxes' in targets and 'labels' in targets:
boxes = targets['boxes'].cpu().numpy()
labels = targets['labels'].cpu().numpy()
for bbox, label in zip(boxes, labels):
w = bbox[2]-bbox[0]
h = bbox[3]-bbox[1]
color = category_colors.get(label, 'gray') # Use gray for unmapped classes
rect = patches.Rectangle((bbox[0], bbox[1]), w, h, linewidth=2, edgecolor=color, facecolor='none')
ax.add_patch(rect)
ax.text(bbox[0], bbox[1], str(label), color='white', fontsize=12, bbox=dict(facecolor=color, alpha=0.5))
def visualize_samples(data_loader, category_colors, num_samples=20):
"""Visualize a specified number of samples from a DataLoader in a single column."""
num_rows = num_samples # All samples in a single column
num_cols = 1
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 25 * num_rows // 4)) # Adjust height based on rows
samples_visualized = 0
for images, targets in data_loader:
for i, ax in enumerate(axs.flat):
if samples_visualized >= num_samples:
break # Stop after displaying the desired number of samples
show_image_with_boxes(images[i], targets[i], ax, category_colors)
ax.axis('off') # Turn off axis for cleaner look
samples_visualized += 1
# If enough samples visualized, break the loop to avoid extra iterations
if samples_visualized >= num_samples:
break
plt.tight_layout()
plt.show()
visualize_samples(train_data_loader, category_colors, num_samples=4)
We can see that everything is good, and the corresponding bounding boxes were also perfectly scaled. Now it’s to move on from Data Preparation to Model Preparation which is another crucial aspect of fine-tuning Faster R-CNN or any deep learning training.
Training Configuration
We will fine-tune for 50 epochs and the best_map
is initialized to -inf
to guarantee that the first computed evaluation metric will always exceed this value, ensuring the initial model weights are captured as the best baseline.
num_epochs = 50
best_map = -float('inf') # Training loop
# print(best_map)
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
We have four detection backbone models trained on the COCO Dataset from torchvision models for fine-tuning Faster R-CNN. We will exploit these pretrained weights to achieve very good detection accuracy within fewer epochs.
However, there are other Object Detection architectures as well, such as SSD, RetinaNet, etc., if you would like to give them a try.
To fit in our Google Colab T4 GPU memory, we will choose a lightweight Mobilenet V3 Large backbone with around 19.4M parameters, 4.49 GFLOPS, and 32.8 Box mAP on the MSCOCO dataset.
As our dataset contains six classes, we will modify the pretrained classification head’s final layer to reflect the number of classes in SeaDroneSee. We will additionally use an SGD optimizer with a momentum of 0.9, an initial learning rate of 5e-4, and StepLR to adjust the learning rate every total_epochs/2 (i.e., at the 25th epoch for a total of 50 epochs).
def get_model(num_classes):
model = detection.fasterrcnn_mobilenet_v3_large_fpn(weights="DEFAULT")
#Get the number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
#Replace pretrained head with new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features,num_classes)
return model
num_classes = 6
model = get_model(num_classes)
model.to(DEVICE)
print(model)
# print(model.fc1(x).size())
params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.SGD(params,lr=0.0005,momentum=0.9,weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(
optimizer,
step_size=num_epochs//2,
gamma=0.1
)
scaler = torch.cuda.amp.GradScaler()
To save compute and training time, we will use CUDA Automatic Mixed Precision (AMP) with torch.cuda.amp.GradScaler()
. This enables mixed precision by using lower precision (16-bit) for some computations while maintaining single precision (32-bit) for critical parts to ensure accuracy.
We will monitor all our training and validation metrics, as well as validation predictions, using TensorBoard via torch.utils.tensorboard
with add_scalar
and add_figure
. For each batch in the data loader, images and targets are moved to the specified device (either CUDA or CPU). The model is set to train mode, and predictions and losses are computed. The losses are then backpropagated, with optimizers and the learning rate scheduler being applied. For multi-GPU training, losses are averaged across all GPUs. Our training pipeline effectively uses a metric logger from torchvision utilities to display metrics at the end of each epoch.
# Initialize TensorBoard writer
writer = SummaryWriter(log_dir='runs/aerial_detection')
def train_one_epoch(model, data_loader, device, optimizer, print_freq, epoch, scaler=None):
model.train()
metric_logger = utils.MetricLogger(delimiter=" ")
metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value:.6f}"))
header = f"Training Epoch {epoch}:"
model.to(device)
with tqdm(data_loader, desc=header) as tq:
lr_scheduler = None
for i, (images, targets) in enumerate(tq):
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
with torch.cuda.amp.autocast(enabled=scaler is not None):
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
loss_value = losses.item()
optimizer.zero_grad()
if scaler is not None:
scaler.scale(losses).backward()
scaler.step(optimizer)
scaler.update()
else:
losses.backward()
optimizer.step()
if lr_scheduler is not None:
lr_scheduler.step()
metric_logger.update(loss=losses, **loss_dict)
metric_logger.update(lr=optimizer.param_groups[0]["lr"])
# Update tqdm postfix to display loss on the progress bar
tq.set_postfix(loss=losses.item(), lr=optimizer.param_groups[0]["lr"])
# Log losses to TensorBoard
writer.add_scalar('Loss/train', losses.item(), epoch * len(data_loader) + i)
for k, v in loss_dict.items():
writer.add_scalar(f'Loss/train_{k}', v.item(), epoch * len(data_loader) + i)
print(f"Average Loss: {metric_logger.meters['loss'].global_avg:.4f}")
writer.add_scalar('Loss/avg_train', metric_logger.meters['loss'].global_avg, epoch)
Subsequently, we will define the evaluate function by setting the model to evaluation mode. With torch.no_grad
, no gradient calculation or weight update occurs. An object detection model is evaluated based on its mAP50
or mAP50-95
(Mean Average Precision). For this, the torchmetrics library’s MeanAveragePrecision class is useful. We pass the predictions and ground truth from the validation data loader to it.
For simplicity, Average Precision (AP) is the area under the precision-recall curve. Mean Average Precision (mAP) is the average of AP across all detected classes.
mAP = 1/n * sum(AP), where n is the number of classes.
def evaluate(model, data_loader, device, epoch, save_dir):
model.eval()
metric = MeanAveragePrecision(iou_type="bbox")
total_iou = 0
total_detections = 0
header = "Validation:"
total_steps = len(data_loader)
samples = []
with torch.no_grad(), tqdm(total=total_steps, desc=header) as progress_bar:
for i, (images, targets) in enumerate(data_loader):
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
outputs = model(images)
# Convert outputs for torchmetrics
preds = [
{"boxes": out["boxes"], "scores": out["scores"], "labels": out["labels"]}
for out in outputs
]
targs = [
{"boxes": tgt["boxes"], "labels": tgt["labels"]}
for tgt in targets
]
# Update metric for mAP calculation
metric.update(preds, targs)
# Collect samples for visualization (limit to 5)
if len(samples) < 5:
for img, out, tgt in zip(images, outputs, targets):
samples.append((img, out, tgt))
if len(samples) >= 5:
break
progress_bar.update(1)
# Visualize predictions
visualize_predictions([s[0] for s in samples], [s[1] for s in samples], [s[2] for s in samples], save_dir, epoch, writer)
results = metric.compute()
print("mAP results:")
print(results)
# Log mAP to TensorBoard
for k, v in results.items():
if v.numel() == 1: # Single element tensor
writer.add_scalar(f'mAP/{k}', v.item(), epoch)
else: # Multi-element tensor, log each element separately
for idx, value in enumerate(v):
writer.add_scalar(f'mAP/{k}_{idx}', value.item(), epoch)
return results
The best model is saved only if the current mAP is better than the previously saved best mAP.
Ready, set, train! Let’s monitor those metrics in TensorBoard.
save_dir = "./prediction_val"
os.makedirs(save_dir, exist_ok=True)
for epoch in range(num_epochs):
# Memory Cleanup.
torch.cuda.empty_cache()
gc.collect()
# train for one epoch, printing every 10 iterations
train_one_epoch(model, train_data_loader, DEVICE, optimizer, print_freq=50, epoch=epoch, scaler=scaler)
# update the learning rate
# lr_scheduler.step()
# evaluate on the validation dataset
results = evaluate(model, val_data_loader, DEVICE, epoch, save_dir='predictions')
# Save the model checkpoint if it's the best mAP
current_map = results['map'].item()
if current_map > best_map:
best_map = current_map
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'best_map': best_map,
'scaler': scaler.state_dict() if scaler is not None else None
}, f'best_model_checkpoint_epoch_{epoch}.pth')
print("That's it!")
writer.close()
Now let’s have a closer look at the logs. The metric results are pretty decent for a lightweight model like Mobilenet that fitted in our Colab T4 GPU. From fine-tuning FasterRCNN with a Mobilenet V3 Large backbone, we have achieved a highest mAP50-95 (mAP) of 38.32 and mAP50 (mAP@IOU=0.5) of 71.24 within four epochs.
We have also fine-tuned FasterRCNN with a Resnet50 v2 backbone variant with 43.7 M
parameters, 280.37 GFLOPS
and achieved an highest mAP@IOU=0.5-0.95
of 49.88
and mAP@IOU=0.5
of 84.12
with the same training configuration and data loaders on a RTX 3080Ti i7-13700K with 12 cores machine.
The training metrics of both the models variants are indeed very impressive.
Predictions
Let’s load the best model checkpoint and adjust the box_nms_thresh
of 0.3
as a postprocessing step to avoid overlapping bounding boxes from the same class instances. As we will use our fine-tuned FasterRCNN model, the pretrained=False
argument is passed. The model’s state dictionary passes checks across all the layers, and for inference the model is set to eval mode.
checkpoint_path = "SeaDroneSee/model_ckpt/Resnet_best_model_checkpoint_epoch_5.pth"
# Function to load the trained model
def load_model(checkpoint_path, device):
model = detection.fasterrcnn_resnet50_fpn_v2(pretrained=False, num_classes=len(classes_to_idx),box_nms_thresh=0.3)
checkpoint = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(device)
model.eval()
return model
# Load the trained model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = load_model(checkpoint_path, device)
Let’s define some utilities to visualize the predictions and save the results to the desired output directory. The show_image_with_boxes
function takes the image tensor and prediction bounding boxes and converts them to a PIL object.
def create_directory(path):
if not os.path.exists(path):
os.makedirs(path)
def show_image_with_boxes(img, targets, ax, category_colors):
"""Plot an image with its bounding boxes on an axis object."""
# Convert tensor image to PIL for display if needed
if isinstance(img, torch.Tensor):
img = to_pil_image(img)
ax.imshow(img)
# Check and plot each bounding box with class-specific color
if 'boxes' in targets and 'labels' in targets and 'scores' in targets:
boxes = targets['boxes'].cpu().numpy()
labels = targets['labels'].cpu().numpy()
scores = targets['scores'].cpu().numpy()
for bbox, label, score in zip(boxes, labels, scores):
if score >= 0.5: # Only show boxes with confidence score >= 0.5
w = bbox[2] - bbox[0]
h = bbox[3] - bbox[1]
color = category_colors.get(label, 'gray') # Use gray for unmapped classes
rect = patches.Rectangle((bbox[0], bbox[1]), w, h, linewidth=2, edgecolor=color, facecolor='none')
ax.add_patch(rect)
ax.text(bbox[0], bbox[1], f'{classes_to_idx[label]}', color='white', fontsize=8, bbox=dict(facecolor=color, alpha=0.5))
def visualize_samples(images, outputs, category_colors, num_samples=10):
"""Visualize a specified number of samples from a DataLoader in a single column."""
num_rows = num_samples # All samples in a single column
num_cols = 1
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 25 * num_rows // 4)) # Adjust height based on rows
for idx, (img, output) in enumerate(zip(images, outputs)):
if idx >= num_samples:
break # Stop after displaying the desired number of samples
show_image_with_boxes(img.cpu(), output, axs[idx], category_colors)
axs[idx].axis('off') # Turn off axis for cleaner look
plt.tight_layout()
plt.show()
Some samples are chosen from a batch of Val dataloader, and the results are plotted.
We can see that fine-tuning Faster R-CNN with patch results are very good; it captures even very small instances.
Combining SAHI with fine-tuned Faster R-CNN
Traditional object detection models often struggle with small objects due to their limited size and the contextual information available in an image. That’s where SAHI comes into play to charm with its great results. SAHI addresses this by employing techniques specifically focusing on augmenting the dataset to highlight these small instances. It enhances the training process by using methods like slicing images into smaller patches where small objects become more prominent and easier to detect.
To learn more about Sliced Aided Hyper Inference (SAHI), bookmark this for later.
Let’s install and import SAHI dependencies.
!pip install -qq -U sahi
from sahi import AutoDetectionModel
from sahi.predict import get_sliced_prediction, predict, get_prediction
from sahi.utils.file import download_from_url
from sahi.prediction import visualize_object_predictions
from sahi.utils.cv import read_image
from IPython.display import Image
We will choose the model type to be torchvision with the AutoDetectionModel
module from SAHI. Let’s initially set a confidence threshold of 0.7 and the image size to the input image’s longest dimension, as our images have rectangular dimension.
detection_model = AutoDetectionModel.from_pretrained(
model_type='torchvision',
model=model, #Faster RCNN Model
confidence_threshold=0.7,
image_size=5436, #Image's longest dimension
device="cpu", # or "cuda:0"
load_at_init=True,
)
Using slice height and slice width we can control the dimension of the sliding window. As our model is trained on a patch dimension of half the size of image dimensions we will choose the slice width and slice height accordingly.
img_path = 'test/7882.jpg'
img_filename_temp = img_path.split('/')[1]
img_filename = img_filename_temp.split('.')[0]
# print(img_filename)
img_pil = PIL.Image.open(img_path)
W,H = img_pil.size
# print(W)
s_h,s_w = H/2,W/2
s_h ,s_w = int(s_h),int(s_w)
The get_sliced_prediction
returns a list of detected object instances with their bbox
, score
and category id
. Here we can see the class id is correct but the corresponding label id is in accordance to COCO classes. So we will fix this by defining some custom functions that does class mapping and draw the bounding box which matches the category id.
result = get_sliced_prediction(
img_path,
detection_model,
slice_height=s_h,
slice_width=s_w,
overlap_height_ratio=0.2,
overlap_width_ratio=0.2,
)
result.object_prediction_list
[ObjectPrediction<
bbox: BoundingBox: <(1754.8331298828125, 1062.62841796875, 1823.0999755859375, 1103.5548362731934), w: 68.266845703125, h: 40.92641830444336>,
mask: None,
score: PredictionScore: <value: 0.9949936270713806>,
category: Category: <id: 1, name: person>>]
This custom draw_bounding_boxes()
utility takes in the image and object_prediction_list
from SAHI’s get_sliced_predictions
and draws visually pleasing predictions.
def draw_bounding_boxes(image, object_prediction_list):
draw = ImageDraw.Draw(image)
font_size = int(min(image.size) * 0.008) # Adjust font size based on image size
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"
font = ImageFont.truetype(font_path, font_size) if os.path.exists(font_path) else ImageFont.load_default()
for prediction in object_prediction_list:
bbox = prediction.bbox.to_xywh()
category_id = prediction.category.id
x, y, w, h = bbox
x1, y1, x2, y2 = x, y, x + w, y + h
color = category_colors.get(category_id, 'white') # Default to white if category_id is unknown
draw.rectangle([x1, y1, x2, y2], outline=color, width=6)
# draw.text((x1, y1 - font_size), str(classes_to_idx[category_id]), fill=color, font=font)
return image
# Draw bounding boxes
image_with_bboxes = draw_bounding_boxes(img_pil, result.object_prediction_list)
# Define the output path
output_directory = 'sahi_ouput_data'
output_path = os.path.join(output_directory, f'result_{img_filename}.png')
# Create the directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)
# Save the resulting image
output_path = f'sahi_ouput_data/result_{img_filename}.png'
image_with_bboxes.save(output_path)
# Display the image (optional, if running in an environment that supports it)
image_with_bboxes.show()
Comparison of Faster RCNN Detection with SAHI v/s Without SAHI v/s with Patches as Input
Comparison 1 : Fine-tuned Faster R-CNN with Mobilenet v3 Large backbone
Original Image Forward Pass
Let’s directly pass the original image by resizing it to train image size of (382,216) without SAHI or without Patch Creation to our fine-tuned Faster R-CNN with Mobilenet v3 Large model.
Here, we can see the model completely missed many instances and performed very poorly.
Faster R-CNN Mobilenet v3 Large Inference with passing Patches as input
Now after performing the same preprocessing steps like patch creation and resizing to train image size carried out during fine-tuning Faster R-CNN model, it manages to capture almost all instances with few instances left undetected under its radar.
Faster R-CNN Mobilenet v3 Large Inference with SAHI
Now additionally SAHI strength has resulted in very neat detections with perfectly aligned bounding boxes. We can notice with and without SAHI the instances detected were different for the class swimmer which shows the difference in their techniques.
Comparison 2 : Fine-tuned Faster R-CNN with Resnet v2 Backbone
Original Image Forward Pass
Same like our Comparison 1 section, let’s directly pass the original image by resizing it to train image size of (382,216) without SAHI or without Patch Creation to our fine-tuned Faster R-CNN with Resnet50 v2 model and see the results.
Faster R-CNN Resnet50 v2 Inference with passing Patches as input
Faster R-CNN Resnet50 v2 Inference with SAHI
From fine-tuning Faster R-CNN Resnet50v2, we can see the prediction results have a significant improvement compared to our fine tuned Faster R-CNN Mobilenet. Additional parameter size and mAP of Faster R-CNN Resnet50v2 has definitely made it as a strong contender even in 2024.
The results are impressive, right? SCROLL UP to learn more about practical code implementation.
Key Takeaways
- We split training images into patches, letting it to focus on details. This trick helped it see tiny objects much better. Even without SAHI’s help, our fine-tuned Faster R-CNN is practically like eagle eyed, matching SAHI’s results almost perfectly.
- Additionally integrating our fine-tuned Faster R-CNN model with SAHI has significantly boosted detection accuracy. SAHI’s advanced technique of slicing images into smaller sections was a game-changer, effectively reducing false positives and achieving near-perfect bounding box instances. This combination showcases the powerful synergy of data preparation and robust post processing technique.
- Our experiments have highlighted the critical importance of data preparation and preprocessing and demonstrates the value of thoughtful data augmentation. It’s all about the groundwork!
Conclusion
The purpose of our experiment is to highlight the importance of meticulous data preparation in a challenging dataset like SeaDronesSee. Despite the high latency and GFLOPS of Faster R-CNN, it proved to be a worthwhile candidate even in 2024. This experiment can be further improved by exploring more lightweight models that offer less latency, real-time processing, and greater accuracy. The results of our study can be used by drone and robotics developers to refine and enhance their detection systems for critical missions.
The impact of our experiment? Potentially saving countless lives with better detection and response capabilities. Now that’s a true superhero rescue mission.
Indeed it is, Let’s make it happen, together!’
Having a career in robotics is a “Pursuit of Happiness”. For a foundational understanding, explore our Comprehensive Robotics Beginners Guide.
References
1. Kaggle Dataset: https://www.kaggle.com/datasets/ubiratanfilho/sds-dataset
2. Torchvision Object Detection Tutorial: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
3. Torchmetrics : https://lightning.ai/docs/torchmetrics/stable/
4. Faster R-CNN : https://arxiv.org/abs/1506.01497
5. SeaDroneSee : https://arxiv.org/abs/2105.01922
6. PyTorch : https://github.com/pytorch/vision/tree/main/torchvision
7. DeciAI: https://deci.ai/blog/small-object-detection-challenges-and-solutions/