Pre-Label Videos with Mask R-CNN¶
This notebook shows how to use a task agent to automatically pre-label videos with predictions. It leverages the off-the-shelf MaskRNN model to generate initial annotations, streamlining the labeling process. If alteratively, you want to train a containerised approach to pre-labeling videos, please check out DETR-Video-labelling
Requirements¶
This notebook guides you through the Workflow template and Ontology required.
For this notebook, you need:
- A Dataset containing videos in Encord.
- Access to Mark R-CNN.
Installation¶
Ensure that you have the encord-agents
library installed:
!python -m pip install encord-agents
# If you don't have torch installed (Colab does by default)
# Please install it by following the guide here: https://pytorch.org/get-started/locally/
Encord Authentication¶
Encord uses ssh-keys for authentication. The following is a code cell for setting the ENCORD_SSH_KEY
environment variable. It contains the raw content of your private ssh key file.
If you have not setup an ssh key, see our documentation.
💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.
os.environ["ENCORD_SSH_KEY"] = """paste-private-key-here"""
import os
os.environ["ENCORD_SSH_KEY"] = "private_key_file_content"
# or you can set a path to a file
# os.environ["ENCORD_SSH_KEY_FILE"] = "/path/to/your/private/key"
[Alternative] Temporary Key¶
There's also the option of generating a temporary (fresh) ssh key pair via the code cell below. Please follow the instructions printed when executing the code.
# ⚠️ Safe to skip if you have authenticated already
import os
from encord_agents.utils.colab import generate_public_private_key_pair_with_instructions
private_key_path, public_key_path = generate_public_private_key_pair_with_instructions()
os.environ["ENCORD_SSH_KEY_FILE"] = private_key_path.as_posix()
Load Mask R-CNN Model¶
Next, we need to load the Mask R-CNN model and its image transform to enable predictions. The following code initializes the model and its associated image transformation.
import torch
import torchvision
import torchvision.models.detection
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.transforms import v2 as T
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_transform():
return T.Compose([T.ToImage(), T.ToDtype(torch.float, scale=True), T.ToPureTensor()])
def get_model_instance_segmentation():
model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights="DEFAULT")
model = model.eval().to(device)
transform = get_transform()
return model, transform
model, transform = get_model_instance_segmentation()
Now, let's define some utility functions to
- Convert the raw tensors from Mask-RNN to the encord bitmask coordinates
- Apply non-maximum suppression (to avoid having many overlapping predictions)
- Convert the raw tensors to Encord
ObjectInstance
s.
from encord.objects import Object as OntologyObject
from encord.objects import ObjectInstance
from encord.objects.bitmask import BitmaskCoordinates
from encord.ontology import OntologyStructure
from torchvision.ops import nms
def to_mask_coordinates(torch_mask: torch.Tensor, threshold: float = 0.5) -> BitmaskCoordinates:
"""
Convert torch mask to bitmask coordinates.
args:
- threshold: threshold at which to cut the mask floating point values. Higher values will yield smaller masks.
returns:
Encord bitmask
"""
binary_mask = (torch_mask > threshold).detach().cpu().numpy().squeeze().astype(bool)
return BitmaskCoordinates(binary_mask)
def apply_nms(pred, nms_iou_threshold: float):
"""
Apply non-maximum suppression to the mask-rcnn predictions.
The method retains the bounding boxes to make it easy to modify the code
to also work for bounding boxes.
"""
indices = nms(pred["boxes"], pred["scores"], nms_iou_threshold)
return {
"masks": pred["masks"][indices],
"boxes": pred["boxes"][indices],
"labels": pred["labels"][indices],
"scores": pred["scores"][indices],
}
def convert_predictions_to_encord(
predictions: dict[str, torch.Tensor],
ontology_map: dict[int, OntologyObject],
frame_idx: int = 0,
conf_threshold: float = 0.50,
nms_iou_threshold: float = 0.3,
) -> list[ObjectInstance]:
"""
Convert mask-rcnn prediction to Encord object instances.
Intended use in pseudo code:
```
preds = model(img)
instances = convert_predictions_to_encord(preds)
[label_row.add_object_instance(ins) for ins in instances]
```
Args:
- predictions: The output of mask-rcnn for one frame.
- ontology_map: The map between predicted labels and the Encord ontology objects.
- frame_idx: The frame number to associate the prediction with.
This is particularly important for videos.
- conf_threshold: The threshold at which we want to retain predictions.
- nms_iou_threshold: The threshold that we wich to select above during nms.
Returns:
- The resulting object instanesl.
"""
# Apply non-maximum suppression
if nms_iou_threshold > 0:
predictions = apply_nms(predictions, nms_iou_threshold)
out: list[ObjectInstance] = []
for mask, label, conf in zip(predictions["masks"], predictions["labels"], predictions["scores"]):
if label.item() not in ontology_map or conf < conf_threshold:
continue
if ont_obj := ontology_map.get(label.item()):
ins = ont_obj.create_instance()
ins.set_for_frames(
frames=frame_idx,
coordinates=to_mask_coordinates(mask),
confidence=conf.item(),
)
out.append(ins)
return out
Next, let us put this to use in an agent. In order to do so, we need i) a project ontology which has classes overlapping with the MaskRCNN classes and ii) a project workflow which allows hooking in a pre-labeling agent.
Set up your Ontology¶
Create an ontology with BITMASK objects named by some of the following classes (those from COCO).
coco_class_names = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table',
'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
Below is an example:
The following code matches these against the right COCO indices and use the pre-trained model to fill in labels according to this Ontology.
📖 Here is the documentation for creating Ontologies.
Create a Workflow with a Pre-Labeling Agent Node¶
Create a Project in the Encord platform with a workflow that includes a pre-labeling agent node before the annotation stage. This node, called "pre-label," runs custom code to generate model predictions, automatically pre-labeling tasks before they are sent for annotation.
📖 Here is the documentation for creating Workflows in Encord.
Define the Pre-Labeling Agent¶
The following code provides a template for defining an agent that does pre-labeling. We assume that the Project only contains videos and the we want to do pre-labeling on all frames in each video.
You will have to update the three identifiers:
<project_hash>
: The project hash of the project that you wish to apply the agent to.<agent_stage_name_or_uuid>
: The workflow stage name (or uuid) that you want to run inference via.<pathway_name_or_uuid>
: The pathway the the task should follow upon prediction.
Note that this code uses the dep_video_iterator
dependency to automatically load an iterator of frames as RGB numpy arrays from the video.
💡 Hint: If you want to only predict, e.g., on the first frame, consider using
from encord_agents.tasks.depencencies import dep_single_frame
instead.
from typing import Iterable
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project
from typing_extensions import Annotated
from encord_agents.core.data_model import Frame
from encord_agents.tasks import Depends, Runner
from encord_agents.tasks.dependencies import dep_video_iterator
BATCH_SIZE = 10
# a. Define a runner that will execute the agent on every task in the agent stage
runner = Runner(project_hash="<project_hash>")
# b. Define ontology map and prepare prediction function
coco_class_names = [
"__background__",
"person",
"bicycle",
"car",
"motorcycle",
"airplane",
"bus",
"train",
"truck",
"boat",
"traffic light",
"fire hydrant",
"N/A",
"stop sign",
"parking meter",
"bench",
"bird",
"cat",
"dog",
"horse",
"sheep",
"cow",
"elephant",
"bear",
"zebra",
"giraffe",
"N/A",
"backpack",
"umbrella",
"N/A",
"N/A",
"handbag",
"tie",
"suitcase",
"frisbee",
"skis",
"snowboard",
"sports ball",
"kite",
"baseball bat",
"baseball glove",
"skateboard",
"surfboard",
"tennis racket",
"bottle",
"N/A",
"wine glass",
"cup",
"fork",
"knife",
"spoon",
"bowl",
"banana",
"apple",
"sandwich",
"orange",
"broccoli",
"carrot",
"hot dog",
"pizza",
"donut",
"cake",
"chair",
"couch",
"potted plant",
"bed",
"N/A",
"dining table",
"N/A",
"N/A",
"toilet",
"N/A",
"tv",
"laptop",
"mouse",
"remote",
"keyboard",
"cell phone",
"microwave",
"oven",
"toaster",
"sink",
"refrigerator",
"N/A",
"book",
"clock",
"vase",
"scissors",
"teddy bear",
"hair drier",
"toothbrush",
]
ont_map = {coco_class_names.index(o.name): o for o in runner.project.ontology_structure.objects}
# c. Define batch predict function
@torch.inference_mode()
def predict_batch(label_row: LabelRowV2, batch: list[Frame]) -> None:
"""
Utility to predict across a batch and store predictions on label row.
"""
input = list(map(lambda i: transform(i.content).to(device), batch))
predictions = model(input)
for frame, pred in zip(batch, predictions):
for ins in convert_predictions_to_encord(pred, ont_map, frame.frame):
label_row.add_object_instance(ins)
# d. Specify the logic that goes into the "pre-label" agent node.
@runner.stage(stage="<agent_stage_name_or_uuid>")
def run_something(
lr: LabelRowV2,
frames: Annotated[Iterable[Frame], Depends(dep_video_iterator)],
) -> str:
batch: list[Frame] = []
for frame in frames:
# Collect batch
batch.append(frame)
# Inference on full batch
if len(batch) == BATCH_SIZE:
predict_batch(lr, batch)
batch = []
# Inference on last "half" batch
if batch:
predict_batch(lr, batch)
lr.save()
return "<pathway_name_or_uuid>" # Tell where the task should go
Running the Agent¶
The runner
object is callable, allowing you to use it to prioritize tasks efficiently.
# Run the agent
# After 5 label updates, tasks will be moved in workflow queue.
runner(task_batch_size=5)
Outcome¶
Your agent assigns labels to videos and routes them through the workflow to the annotation stage. As a result, each annotation task includes pre-labeled predictions.
💡 To run this as a command-line interface, save the code in an
agents.py
file and replace:runner()with:
if __name__ == "__main__": runner.run()This lets you set parameters like the project hash from the command line:
python agent.py --project-hash "..."
If you've followed this sucessfully or have another geometric pre-labelling usecase and are thinking about how to deploy your model, please see: DETR-Video-labelling for an example Dockerfile and container setup. This can make deploying and running your model easier.