{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pre-Label Videos with Mask R-CNN\n",
"\n",
"This notebook shows how to use a task agent to automatically pre-label videos with predictions. It leverages the off-the-shelf MaskRNN model to generate initial annotations, streamlining the labeling process. If alteratively, you want to train a containerised approach to pre-labeling videos, please check out [DETR-Video-labelling](https://github.com/encord-team/encord-agents/tree/main/docker/detr-video-labeling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Requirements\n",
"\n",
"This notebook guides you through the Workflow template and Ontology required. \n",
"\n",
"For this notebook, you need: \n",
" \n",
"- A Dataset containing videos in Encord. \n",
"- Access to Mark R-CNN."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"Ensure that you have the `encord-agents` library installed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python -m pip install encord-agents[vision]\n",
"# If you don't have torch installed (Colab does by default)\n",
"# Please install it by following the guide here: https://pytorch.org/get-started/locally/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encord Authentication\n",
"\n",
"Encord uses ssh-keys for authentication. The following is a code cell for setting the `ENCORD_SSH_KEY` environment variable. It contains the raw content of your private ssh key file.\n",
"\n",
"If you have not setup an ssh key, see our [documentation](https://agents-docs.encord.com/authentication/).\n",
"\n",
"> 💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.\n",
"> ```python\n",
"> os.environ[\"ENCORD_SSH_KEY\"] = \"\"\"paste-private-key-here\"\"\"\n",
"> ```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"ENCORD_SSH_KEY\"] = \"private_key_file_content\"\n",
"# or you can set a path to a file\n",
"# os.environ[\"ENCORD_SSH_KEY_FILE\"] = \"/path/to/your/private/key\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### [Alternative] Temporary Key\n",
"There's also the option of generating a temporary (fresh) ssh key pair via the code cell below.\n",
"Please follow the instructions printed when executing the code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ⚠️ Safe to skip if you have authenticated already\n",
"import os\n",
"\n",
"from encord_agents.utils.colab import generate_public_private_key_pair_with_instructions\n",
"\n",
"private_key_path, public_key_path = generate_public_private_key_pair_with_instructions()\n",
"os.environ[\"ENCORD_SSH_KEY_FILE\"] = private_key_path.as_posix()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Mask R-CNN Model\n",
"\n",
"Next, we need to load the Mask R-CNN model and its image transform to enable predictions. The following code initializes the model and its associated image transformation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torchvision\n",
"import torchvision.models.detection\n",
"from torchvision.models.detection.faster_rcnn import FastRCNNPredictor\n",
"from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor\n",
"from torchvision.transforms import v2 as T\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"\n",
"\n",
"def get_transform():\n",
" return T.Compose([T.ToImage(), T.ToDtype(torch.float, scale=True), T.ToPureTensor()])\n",
"\n",
"\n",
"def get_model_instance_segmentation():\n",
" model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights=\"DEFAULT\")\n",
" model = model.eval().to(device)\n",
" transform = get_transform()\n",
" return model, transform\n",
"\n",
"\n",
"model, transform = get_model_instance_segmentation()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's define some utility functions to\n",
"\n",
"1. Convert the raw tensors from Mask-RNN to the encord bitmask coordinates\n",
"2. Apply non-maximum suppression (to avoid having many overlapping predictions)\n",
"3. Convert the raw tensors to Encord `ObjectInstance`s."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from encord.objects import Object as OntologyObject\n",
"from encord.objects import ObjectInstance\n",
"from encord.objects.bitmask import BitmaskCoordinates\n",
"from encord.ontology import OntologyStructure\n",
"from torchvision.ops import nms\n",
"\n",
"\n",
"def to_mask_coordinates(torch_mask: torch.Tensor, threshold: float = 0.5) -> BitmaskCoordinates:\n",
" \"\"\"\n",
" Convert torch mask to bitmask coordinates.\n",
"\n",
" args:\n",
" - threshold: threshold at which to cut the mask floating point values. Higher values will yield smaller masks.\n",
"\n",
" returns:\n",
" Encord bitmask\n",
" \"\"\"\n",
" binary_mask = (torch_mask > threshold).detach().cpu().numpy().squeeze().astype(bool)\n",
" return BitmaskCoordinates(binary_mask)\n",
"\n",
"\n",
"def apply_nms(pred, nms_iou_threshold: float):\n",
" \"\"\"\n",
" Apply non-maximum suppression to the mask-rcnn predictions.\n",
"\n",
" The method retains the bounding boxes to make it easy to modify the code\n",
" to also work for bounding boxes.\n",
" \"\"\"\n",
" indices = nms(pred[\"boxes\"], pred[\"scores\"], nms_iou_threshold)\n",
" return {\n",
" \"masks\": pred[\"masks\"][indices],\n",
" \"boxes\": pred[\"boxes\"][indices],\n",
" \"labels\": pred[\"labels\"][indices],\n",
" \"scores\": pred[\"scores\"][indices],\n",
" }\n",
"\n",
"\n",
"def convert_predictions_to_encord(\n",
" predictions: dict[str, torch.Tensor],\n",
" ontology_map: dict[int, OntologyObject],\n",
" frame_idx: int = 0,\n",
" conf_threshold: float = 0.50,\n",
" nms_iou_threshold: float = 0.3,\n",
") -> list[ObjectInstance]:\n",
" \"\"\"\n",
" Convert mask-rcnn prediction to Encord object instances.\n",
"\n",
" Intended use in pseudo code:\n",
"\n",
" ```\n",
" preds = model(img)\n",
" instances = convert_predictions_to_encord(preds)\n",
" [label_row.add_object_instance(ins) for ins in instances]\n",
" ```\n",
"\n",
" Args:\n",
" - predictions: The output of mask-rcnn for one frame.\n",
" - ontology_map: The map between predicted labels and the Encord ontology objects.\n",
" - frame_idx: The frame number to associate the prediction with.\n",
" This is particularly important for videos.\n",
" - conf_threshold: The threshold at which we want to retain predictions.\n",
" - nms_iou_threshold: The threshold that we wich to select above during nms.\n",
"\n",
" Returns:\n",
" - The resulting object instanesl.\n",
" \"\"\"\n",
"\n",
" # Apply non-maximum suppression\n",
" if nms_iou_threshold > 0:\n",
" predictions = apply_nms(predictions, nms_iou_threshold)\n",
"\n",
" out: list[ObjectInstance] = []\n",
" for mask, label, conf in zip(predictions[\"masks\"], predictions[\"labels\"], predictions[\"scores\"]):\n",
" if label.item() not in ontology_map or conf < conf_threshold:\n",
" continue\n",
"\n",
" if ont_obj := ontology_map.get(label.item()):\n",
" ins = ont_obj.create_instance()\n",
" ins.set_for_frames(\n",
" frames=frame_idx,\n",
" coordinates=to_mask_coordinates(mask),\n",
" confidence=conf.item(),\n",
" )\n",
" out.append(ins)\n",
" return out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us put this to use in an agent.\n",
"In order to do so, we need i) a project ontology which has classes overlapping with the MaskRCNN classes and ii) a project workflow which allows hooking in a pre-labeling agent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up your Ontology\n",
"\n",
"Create an ontology with __BITMASK__ objects named by some of the following classes (those from COCO).\n",
"\n",
"```\n",
"coco_class_names = [\n",
" 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',\n",
" 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',\n",
" 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',\n",
" 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',\n",
" 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',\n",
" 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',\n",
" 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',\n",
" 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',\n",
" 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table',\n",
" 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',\n",
" 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book',\n",
" 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'\n",
"]\n",
"```\n",
"\n",
"Below is an example:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
" Figure 1: Project ontology.\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following code matches these against the right COCO indices and use the pre-trained model to fill in labels according to this Ontology.\n",
"\n",
"[📖 Here](https://docs.encord.com/platform-documentation/GettingStarted/gettingstarted-create-ontology) is the documentation for creating Ontologies."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Workflow with a Pre-Labeling Agent Node\n",
"\n",
"Create a Project in the Encord platform with a workflow that includes a pre-labeling agent node before the annotation stage. This node, called **\"pre-label,\"** runs custom code to generate model predictions, automatically pre-labeling tasks before they are sent for annotation.\n",
"\n",
"[📖 Here](https://docs.encord.com/platform-documentation/Annotate/annotate-projects/annotate-workflows-and-templates#creating-workflows) is the documentation for creating Workflows in Encord."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
" Figure 2: Project workflow.\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the Pre-Labeling Agent\n",
"\n",
"The following code provides a template for defining an agent that does pre-labeling.\n",
"We assume that the Project only contains videos and the we want to do pre-labeling on all frames in each video.\n",
"\n",
"You will have to update the three identifiers: \n",
"\n",
"- ``: The project hash of the project that you wish to apply the agent to.\n",
"- ``: The workflow stage name (or uuid) that you want to run inference via.\n",
"- ``: The pathway the the task should follow upon prediction.\n",
"\n",
"\n",
"Note that this code uses the [`dep_video_sampler` dependency](../../reference/task_agents.md#encord_agents.tasks.dependencies.dep_video_sampler) to automatically load a sampler of frames as RGB numpy arrays from the video. The sampler allows for either a fixed rate frame sampling method or taking a list of target frames to utilize.\n",
"\n",
"> 💡 Hint: If you want to only predict, e.g., on the first frame, consider using `from encord_agents.tasks.depencencies import dep_single_frame` instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import Callable, Iterable, Sequence\n",
"\n",
"from encord.objects.ontology_labels_impl import LabelRowV2\n",
"from encord.project import Project\n",
"from typing_extensions import Annotated\n",
"\n",
"from encord_agents.core.data_model import Frame\n",
"from encord_agents.tasks import Depends, Runner\n",
"from encord_agents.tasks.dependencies import dep_video_sampler\n",
"\n",
"BATCH_SIZE = 10\n",
"\n",
"# a. Define a runner that will execute the agent on every task in the agent stage\n",
"runner = Runner(project_hash=\"\")\n",
"\n",
"# b. Define ontology map and prepare prediction function\n",
"coco_class_names = [\n",
" \"__background__\",\n",
" \"person\",\n",
" \"bicycle\",\n",
" \"car\",\n",
" \"motorcycle\",\n",
" \"airplane\",\n",
" \"bus\",\n",
" \"train\",\n",
" \"truck\",\n",
" \"boat\",\n",
" \"traffic light\",\n",
" \"fire hydrant\",\n",
" \"N/A\",\n",
" \"stop sign\",\n",
" \"parking meter\",\n",
" \"bench\",\n",
" \"bird\",\n",
" \"cat\",\n",
" \"dog\",\n",
" \"horse\",\n",
" \"sheep\",\n",
" \"cow\",\n",
" \"elephant\",\n",
" \"bear\",\n",
" \"zebra\",\n",
" \"giraffe\",\n",
" \"N/A\",\n",
" \"backpack\",\n",
" \"umbrella\",\n",
" \"N/A\",\n",
" \"N/A\",\n",
" \"handbag\",\n",
" \"tie\",\n",
" \"suitcase\",\n",
" \"frisbee\",\n",
" \"skis\",\n",
" \"snowboard\",\n",
" \"sports ball\",\n",
" \"kite\",\n",
" \"baseball bat\",\n",
" \"baseball glove\",\n",
" \"skateboard\",\n",
" \"surfboard\",\n",
" \"tennis racket\",\n",
" \"bottle\",\n",
" \"N/A\",\n",
" \"wine glass\",\n",
" \"cup\",\n",
" \"fork\",\n",
" \"knife\",\n",
" \"spoon\",\n",
" \"bowl\",\n",
" \"banana\",\n",
" \"apple\",\n",
" \"sandwich\",\n",
" \"orange\",\n",
" \"broccoli\",\n",
" \"carrot\",\n",
" \"hot dog\",\n",
" \"pizza\",\n",
" \"donut\",\n",
" \"cake\",\n",
" \"chair\",\n",
" \"couch\",\n",
" \"potted plant\",\n",
" \"bed\",\n",
" \"N/A\",\n",
" \"dining table\",\n",
" \"N/A\",\n",
" \"N/A\",\n",
" \"toilet\",\n",
" \"N/A\",\n",
" \"tv\",\n",
" \"laptop\",\n",
" \"mouse\",\n",
" \"remote\",\n",
" \"keyboard\",\n",
" \"cell phone\",\n",
" \"microwave\",\n",
" \"oven\",\n",
" \"toaster\",\n",
" \"sink\",\n",
" \"refrigerator\",\n",
" \"N/A\",\n",
" \"book\",\n",
" \"clock\",\n",
" \"vase\",\n",
" \"scissors\",\n",
" \"teddy bear\",\n",
" \"hair drier\",\n",
" \"toothbrush\",\n",
"]\n",
"ont_map = {coco_class_names.index(o.name): o for o in runner.project.ontology_structure.objects}\n",
"\n",
"\n",
"# c. Define batch predict function\n",
"@torch.inference_mode()\n",
"def predict_batch(label_row: LabelRowV2, batch: list[Frame]) -> None:\n",
" \"\"\"\n",
" Utility to predict across a batch and store predictions on label row.\n",
" \"\"\"\n",
" input = list(map(lambda i: transform(i.content).to(device), batch))\n",
" predictions = model(input)\n",
"\n",
" for frame, pred in zip(batch, predictions):\n",
" for ins in convert_predictions_to_encord(pred, ont_map, frame.frame):\n",
" label_row.add_object_instance(ins)\n",
"\n",
"\n",
"# d. Specify the logic that goes into the \"pre-label\" agent node.\n",
"@runner.stage(stage=\"\")\n",
"def run_something(\n",
" lr: LabelRowV2,\n",
" frame_sampler: Annotated[Callable[[float | Sequence[int]], Iterable[Frame]], Depends(dep_video_sampler)],\n",
") -> str:\n",
" batch: list[Frame] = []\n",
" for frame in frame_sampler(1 / 5):\n",
" # Collect batch\n",
" batch.append(frame)\n",
"\n",
" # Inference on full batch\n",
" if len(batch) == BATCH_SIZE:\n",
" predict_batch(lr, batch)\n",
" batch = []\n",
"\n",
" # Inference on last \"half\" batch\n",
" if batch:\n",
" predict_batch(lr, batch)\n",
"\n",
" lr.save()\n",
" return \"\" # Tell where the task should go"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the Agent\n",
"\n",
"The `runner` object is callable, allowing you to use it to prioritize tasks efficiently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Run the agent\n",
"# After 5 label updates, tasks will be moved in workflow queue.\n",
"runner(task_batch_size=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outcome\n",
"\n",
"Your agent assigns labels to videos and routes them through the workflow to the annotation stage. As a result, each annotation task includes pre-labeled predictions. \n",
"\n",
"> 💡 To run this as a command-line interface, save the code in an `agents.py` file and replace: \n",
"> ```python\n",
"> runner()\n",
"> ``` \n",
"> with: \n",
"> ```python\n",
"> if __name__ == \"__main__\":\n",
"> runner.run()\n",
"> ``` \n",
"> This lets you set parameters like the project hash from the command line: \n",
"> ```bash\n",
"> python agent.py --project-hash \"...\"\n",
"> ```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you've followed this sucessfully or have another geometric pre-labelling usecase and are thinking about how to deploy your model, please see: [DETR-Video-labelling](https://github.com/encord-team/encord-agents/tree/main/docker/detr-video-labeling) for an example Dockerfile and container setup. This can make deploying and running your model easier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "encord-agents-tO19NJQ2-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}