Pre-label video with Mask-RCNN¶
This notebook demonstrates how to use a task agent to pre-label videos with predictions. We will use the off-the-shelf model MaskRNN in this case.
Before we start, let's get installations and authentication out of the way.
Step 1: Set up environment¶
Installation¶
Please ensure that you have the encord-agents
library installed:
!python -m pip install encord-agents
# If you don't have torch installed (Colab does by default)
# Please install it by following the guide here: https://pytorch.org/get-started/locally/
Authentication¶
The library authenticates via ssh-keys. Below, is a code cell for setting the ENCORD_SSH_KEY
environment variable. It should contain the raw content of your private ssh key file.
If you have not yet setup an ssh key, please follow the documentation.
💡 Colab users: In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks with
from google.colab import userdata key_content = userdata.get("ENCORD_SSH_KEY")
import os
os.environ["ENCORD_SSH_KEY"] = "private_key_file_content"
# or you can set a path to a file
# os.environ["ENCORD_SSH_KEY_FILE"] = "/path/to/your/private/key"
[Alternative] Temporary Key¶
There's also the option of generating a temporary (fresh) ssh key pair via the code cell below. Please follow the instructions printed when executing the code.
# ⚠️ Safe to skip if you have authenticated already
import os
from encord_agents.utils.colab import generate_public_private_key_pair_with_instructions
private_key_path, public_key_path = generate_public_private_key_pair_with_instructions()
os.environ["ENCORD_SSH_KEY_FILE"] = private_key_path.as_posix()
Step 2: Load mask-RCNN¶
Let's load the Mask-RCNN model and it's image transform such that we can use it for predictions.
Below, we load the model and it's image transform.
import torch
import torchvision
import torchvision.models.detection
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.transforms import v2 as T
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_transform():
return T.Compose([T.ToImage(), T.ToDtype(torch.float, scale=True), T.ToPureTensor()])
def get_model_instance_segmentation():
model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights="DEFAULT")
model = model.eval().to(device)
transform = get_transform()
return model, transform
model, transform = get_model_instance_segmentation()
Now, let's define some utility functions to
- Convert the raw tensors from Mask-RNN to the encord bitmask coordinates
- Apply non-maximum suppression (to avoid having many overlapping predictions)
- Convert the raw tensors to Encord
ObjectInstance
s.
from encord.objects import Object as OntologyObject
from encord.objects import ObjectInstance
from encord.objects.bitmask import BitmaskCoordinates
from encord.ontology import OntologyStructure
from torchvision.ops import nms
def to_mask_coordinates(torch_mask: torch.Tensor, threshold: float = 0.5) -> BitmaskCoordinates:
"""
Convert torch mask to bitmask coordinates.
args:
- threshold: threshold at which to cut the mask floating point values. Higher values will yield smaller masks.
returns:
Encord bitmask
"""
binary_mask = (torch_mask > threshold).detach().cpu().numpy().squeeze().astype(bool)
return BitmaskCoordinates(binary_mask)
def apply_nms(pred, nms_iou_threshold: float):
"""
Apply non-maximum suppression to the mask-rcnn predictions.
The method retains the bounding boxes to make it easy to modify the code
to also work for bounding boxes.
"""
indices = nms(pred["boxes"], pred["scores"], nms_iou_threshold)
return {
"masks": pred["masks"][indices],
"boxes": pred["boxes"][indices],
"labels": pred["labels"][indices],
"scores": pred["scores"][indices],
}
def convert_predictions_to_encord(
predictions: dict[str, torch.Tensor],
ontology_map: dict[int, OntologyObject],
frame_idx: int = 0,
conf_threshold: float = 0.50,
nms_iou_threshold: float = 0.3,
) -> list[ObjectInstance]:
"""
Convert mask-rcnn prediction to Encord object instances.
Intended use in pseudo code:
```
preds = model(img)
instances = convert_predictions_to_encord(preds)
[label_row.add_object_instance(ins) for ins in instances]
```
Args:
- predictions: The output of mask-rcnn for one frame.
- ontology_map: The map between predicted labels and the Encord ontology objects.
- frame_idx: The frame number to associate the prediction with.
This is particularly important for videos.
- conf_threshold: The threshold at which we want to retain predictions.
- nms_iou_threshold: The threshold that we wich to select above during nms.
Returns:
- The resulting object instanesl.
"""
# Apply non-maximum suppression
if nms_iou_threshold > 0:
predictions = apply_nms(predictions, nms_iou_threshold)
out: list[ObjectInstance] = []
for mask, label, conf in zip(predictions["masks"], predictions["labels"], predictions["scores"]):
if label.item() not in ontology_map or conf < conf_threshold:
continue
if ont_obj := ontology_map.get(label.item()):
ins = ont_obj.create_instance()
ins.set_for_frames(
frames=frame_idx,
coordinates=to_mask_coordinates(mask),
confidence=conf.item(),
)
out.append(ins)
return out
Next, let us put this to use in an agent. In order to do so, we need i) a project ontology which has classes overlapping with the MaskRCNN classes and ii) a project workflow which allows hooking in a pre-labeling agent.
Step 3: Set up your Ontology¶
Create an ontology with BITMASK objects named by some of the following classes (those from COCO).
coco_class_names = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table',
'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
Below is an example:
The code below will match these against the right coco indices and use the pre-trained model to fill in labels according to this ontology.
📖 Here is the documentation for creating ontologies.
Step 4: Create a Workflow with a pre-labeling agent node¶
Create a project in the Encord platform that has a Workflow that includes a pre-labeling agent node before the annotation stage to automatically pre-label tasks with model predictions. This node is where we'll hook in Mask-RCNN e to pre-label the data.
Notice how the workflow has a purple Agent node called "pre-label." This node will allow our custom code to run inference over the data before passing it on to the annotation stage.
📖 Here is the documentation for creating a workflow with Encord.
Step 5: Define the pre-labelling agent¶
The following code provides a template for defining an agent that does pre-labeling. We assume that the project only contains videos and the we want to do pre-labeling on all frames in each video.
You will have to update the three identifiers:
<project_hash>
: The project hash of the project that you wish to apply the agent to.<agent_stage_name_or_uuid>
: The workflow stage name (or uuid) that you want to run inference via.<pathway_name_or_uuid>
: The pathway the the task should follow upon prediction.
Note that this code uses the dep_video_iterator
dependency to automatically load an iterator of frames as RGB numpy arrays from the video.
💡 Hint: If you want to only predict, e.g., on the first frame, concider using
from encord_agents.tasks.depencencies import dep_single_frame
instead.
from typing import Iterable
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project
from typing_extensions import Annotated
from encord_agents.core.data_model import Frame
from encord_agents.tasks import Depends, Runner
from encord_agents.tasks.dependencies import dep_video_iterator
BATCH_SIZE = 10
# a. Define a runner that will execute the agent on every task in the agent stage
runner = Runner(project_hash="<project_hash>")
# b. Define ontology map and prepare prediction function
coco_class_names = [
"__background__",
"person",
"bicycle",
"car",
"motorcycle",
"airplane",
"bus",
"train",
"truck",
"boat",
"traffic light",
"fire hydrant",
"N/A",
"stop sign",
"parking meter",
"bench",
"bird",
"cat",
"dog",
"horse",
"sheep",
"cow",
"elephant",
"bear",
"zebra",
"giraffe",
"N/A",
"backpack",
"umbrella",
"N/A",
"N/A",
"handbag",
"tie",
"suitcase",
"frisbee",
"skis",
"snowboard",
"sports ball",
"kite",
"baseball bat",
"baseball glove",
"skateboard",
"surfboard",
"tennis racket",
"bottle",
"N/A",
"wine glass",
"cup",
"fork",
"knife",
"spoon",
"bowl",
"banana",
"apple",
"sandwich",
"orange",
"broccoli",
"carrot",
"hot dog",
"pizza",
"donut",
"cake",
"chair",
"couch",
"potted plant",
"bed",
"N/A",
"dining table",
"N/A",
"N/A",
"toilet",
"N/A",
"tv",
"laptop",
"mouse",
"remote",
"keyboard",
"cell phone",
"microwave",
"oven",
"toaster",
"sink",
"refrigerator",
"N/A",
"book",
"clock",
"vase",
"scissors",
"teddy bear",
"hair drier",
"toothbrush",
]
ont_map = {coco_class_names.index(o.name): o for o in runner.project.ontology_structure.objects}
# c. Define batch predict function
@torch.inference_mode()
def predict_batch(label_row: LabelRowV2, batch: list[Frame]) -> None:
"""
Utility to predict across a batch and store predictions on label row.
"""
input = list(map(lambda i: transform(i.content).to(device), batch))
predictions = model(input)
for frame, pred in zip(batch, predictions):
for ins in convert_predictions_to_encord(pred, ont_map, frame.frame):
label_row.add_object_instance(ins)
# d. Specify the logic that goes into the "pre-label" agent node.
@runner.stage(stage="<agent_stage_name_or_uuid>")
def run_something(
lr: LabelRowV2,
frames: Annotated[Iterable[Frame], Depends(dep_video_iterator)],
) -> str:
batch: list[Frame] = []
for frame in frames:
# Collect batch
batch.append(frame)
# Inference on full batch
if len(batch) == BATCH_SIZE:
predict_batch(lr, batch)
batch = []
# Inference on last "half" batch
if batch:
predict_batch(lr, batch)
lr.save()
return "<pathway_name_or_uuid>" # Tell where the task should go
Running the agent¶
Now that we've defined the project, workflow, and the agent, it's time to try it out.
The runner
object is callable which means that you can just call it to prioritize your tasks.
# Run the agent
# After 5 label updates, tasks will be moved in workflow queue.
runner(task_batch_size=5)
Your agent now assigns labels to the videos and routes them appropriately through the Workflow to the annotation stage. As a result, every annotation task should already have pre-existing labels (predictions) included.
💡Hint: If you were to execute this as a python script, you can run it as a command line interface by putting the above code in an
agents.py
file and replacingrunner()with
if __name__ == "__main__": runner.run()Which will allow you set, e.g., the project hash via the command line:
python agent.py --project-hash "..."