{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multistage VLM video captioning\n",
"\n",
"Inspired by the approach used in the [CogX Video model](https://github.com/THUDM/CogVideo) this notebook:\n",
"\n",
"1. Uses a VLM to generate frame-specific captions.\n",
"2. Compile these captions into a textual summary using an LLM.\n",
"\n",
"> 💡 This method is most effective for shorter clips with minimal scene changes. \n",
"\n",
"CogVideoX leverages these captioned videos to train text-to-video models, and similar workflows can be applied to other multimodal generative AI applications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Set Up Environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Ensure that you install:\n",
"- The `encord-agents` library.\n",
"- The `openai` library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python -m pip install encord-agents openai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Authentication\n",
"\n",
"The Encord agents library authenticates using ssh keys, and OpenAI using an API key. The following code cell for setting the `ENCORD_SSH_KEY` and `OPENAI_API_KEY`environment variables. It must contain the raw content of your private ssh key file and OpenAI API key.\n",
"\n",
"> - Replace `private_key_file_content` with your Encord private key.\n",
"> - Replace `api_key_file_content` with your OpenAI API key.\n",
"\n",
"If you have not yet setup an ssh key, please follow the [documentation](https://agents-docs.encord.com/authentication/).\n",
"\n",
"> 💡 In the colab notebook, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.\n",
"> ```python\n",
"> import os\n",
"> from google.colab import userdata\n",
"> os.environ[\"ENCORD_SSH_KEY\"] = userdata.get(\"ENCORD_SSH_KEY\")\n",
"> os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")\n",
"> ```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"ENCORD_SSH_KEY\"] = \"private_key_file_content\"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"api_key_file_content\"\n",
"\n",
"# or you can set a path to a file\n",
"# os.environ[\"ENCORD_SSH_KEY_FILE\"] = \"/path/to/your/private/key\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Set up Encord environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1: Set up an Ontology\n",
"\n",
"Set up a simple Ontology containing:\n",
"\n",
"- A text classification to summerise the entire task.\n",
"- A text classification to summerise a single frame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.2: Create a Workflow Template\n",
"\n",
"Employ a simple Workflow with 2 separate Agent stages. The first \"Frame Captioning\" stage summerises individual frame, and the second \"Semating Summarization\" stage summerises the entire task."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3: Defining the Agents\n",
"\n",
"Both agents can be defined together. When using the runner, it iterates through the defined agents, fetching all tasks at their respective stages and processing them. The system follows the natural ordering of the Workflow Graph, prioritizing earlier tasks to ensure they progress to later stages efficiently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dataclasses import dataclass\n",
"\n",
"from openai import OpenAI\n",
"\n",
"\n",
"# Data class to hold predictions from our model\n",
"@dataclass\n",
"class ModelPrediction:\n",
" caption: str\n",
" conf: float\n",
"\n",
"\n",
"def caption_from_openai(base64_img_url: str) -> str:\n",
" openai_client = OpenAI()\n",
" prompt = \"Please summarise the following frame from a video. Be succinct and start immediately. Describe the frame and only the frame directly\"\n",
" response = openai_client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": base64_img_url}},\n",
" ],\n",
" }\n",
" ],\n",
" )\n",
" return response.choices[0].message.content or \"Failed to summarise\"\n",
"\n",
"\n",
"def caption_frame(base64_img_url: str) -> ModelPrediction:\n",
" caption = caption_from_openai(base64_img_url)\n",
" conf = 0.5 # One could try to ask VLM on their confidence of the individual caption\n",
" return ModelPrediction(caption=caption, conf=conf)\n",
"\n",
"\n",
"model = caption_frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ensure that you replace `` with the unique ID of your Encord Project."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import Iterable\n",
"\n",
"from encord.objects.classification import Classification\n",
"from encord.objects.classification_instance import ClassificationInstance\n",
"from encord.objects.ontology_labels_impl import LabelRowV2\n",
"from encord.project import Project\n",
"from typing_extensions import Annotated\n",
"\n",
"from encord_agents.core.data_model import Frame\n",
"from encord_agents.tasks import Depends, Runner\n",
"from encord_agents.tasks.dependencies import dep_video_iterator\n",
"\n",
"# a. Define a runner that executes the agent on every task in the agent stage\n",
"runner = Runner(project_hash=\"\")\n",
"\n",
"\n",
"# b. Specify the logic that goes into the \"pre-label\" agent node.\n",
"@runner.stage(stage=\"Frame Captioning\")\n",
"def caption_video(\n",
" lr: LabelRowV2,\n",
" project: Project,\n",
" frames: Annotated[Iterable[Frame], Depends(dep_video_iterator)],\n",
") -> str:\n",
" ontology = project.ontology_structure\n",
" captions: dict[int, ModelPrediction] = {}\n",
" # c. Loop over the frames in the video\n",
" ontology_element: Classification = ontology.get_child_by_title(\"Frame by frame captioning\")\n",
" for frame_idx, frame in enumerate(frames):\n",
" if frame_idx % 48 != 0: # For every 48th frame in the video\n",
" continue\n",
" frame_dense_summarisation: ClassificationInstance = ontology_element.create_instance()\n",
" frame_caption = model(frame.b64_encoding())\n",
" captions[frame_idx] = frame_caption\n",
" frame_dense_summarisation.set_answer(frame_caption.caption)\n",
" frame_dense_summarisation.set_for_frames(frame_idx, overwrite=True)\n",
" lr.add_classification_instance(frame_dense_summarisation, force=True)\n",
"\n",
" lr.save()\n",
" return \"Captioned\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we define the summarization component of the agent. This step takes the frame-tagged captions and processes them through an LLM (which does not need to be vision-based) to generate a concise summary of the entire video. \n",
"\n",
"The prompt used for this process is directly influenced by the CogX paper, ensuring effective aggregation of individual frame captions into a coherent summary."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_prompt(dict_frame_caption: dict[int, str]) -> str:\n",
" summarisation_prompt = f\"\"\"\n",
" We extracted several frames from this video and described\n",
" each frame using an image understanding model, stored\n",
" in the dictionary variable ‘image_captions: Dict[str: str]‘.\n",
" In ‘image_captions‘, the key is the frame at which the image\n",
" appears in the video, and the value is a detailed description\n",
" of the image at that moment. Please describe the content of\n",
" this video in as much detail as possible, based on the\n",
" information provided by ‘image_captions‘, including\n",
" the objects, scenery, animals, characters, and camera\n",
" movements within the video. \\n image_captions={dict_frame_caption}\\n\n",
" You should output your summary directly, and not mention\n",
" variables like ‘image_captions‘ in your response.\n",
" Do not include ‘\\\\n’ and the word ’video’ in your response.\n",
" Do not use introductory phrases such as: \\\"The video\n",
" presents\\\", \\\"The video depicts\\\", \\\"This video showcases\\\",\n",
" \\\"The video captures\\\" and so on.\\n Please start the\n",
" description with the video content directly, such as \\\"A man\n",
" first sits in a chair, then stands up and walks to the\n",
" kitchen....\\\"\\n Do not use phrases like: \\\"as the video\n",
" progressed\\\" and \\\"Throughout the video\\\".\\n\n",
" the content of the video and the changes that occur, in\n",
" chronological order.\\n Please keep the description of this\n",
" video within 100 English words.\n",
" \"\"\"\n",
" return summarisation_prompt\n",
"\n",
"\n",
"def get_summaerised_caption(dict_frame_caption: dict[int, str]) -> str:\n",
" prompt = get_prompt(dict_frame_caption)\n",
" openai_client = OpenAI()\n",
" resp = openai_client.chat.completions.create(\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": prompt},\n",
" ],\n",
" model=\"gpt-4o\",\n",
" )\n",
" return resp.choices[0].message.content or \"Failed to get summarisation\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@runner.stage(stage=\"Semantic Summarization\")\n",
"def summarise_captions(lr: LabelRowV2, project: Project) -> str:\n",
" ontology = project.ontology_structure\n",
" ontology_element: Classification = ontology.get_child_by_title(\"Frame by frame captioning\")\n",
" classification_elements = lr.get_classification_instances(filter_ontology_classification=ontology_element)\n",
" dict_frame_caption: dict[int, str] = {}\n",
" for inst in classification_elements:\n",
" anno = inst.get_annotations()[0]\n",
" dict_frame_caption[anno.frame] = str(inst.get_answer())\n",
" summarised_caption = get_summaerised_caption(dict_frame_caption)\n",
" succint_ontology_element: Classification = ontology.get_child_by_title(\"Summarisation of entire clip\")\n",
" succ_inst = succint_ontology_element.create_instance()\n",
" succ_inst.set_answer(summarised_caption)\n",
" succ_inst.set_for_frames(frames=0)\n",
" lr.add_classification_instance(succ_inst)\n",
" lr.save()\n",
" return \"Summarised\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the Agent\n",
"\n",
"With the project, workflow, and agent defined, it's time to put everything into action. The `runner` object is callable, allowing you to execute it directly to prioritize and process your tasks efficiently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Run the agent\n",
"runner()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your agent now assigns labels to videos and routes them through the workflow to the annotation stage. As a result, each annotation task includes pre-existing labels (predictions). \n",
"\n",
"> 💡*Hint:* To run this as a Python script, place the above code in an `agents.py` file and change: \n",
"> ```python\n",
"> runner()\n",
"> ```\n",
"> to \n",
"> ```python\n",
"> if __name__ == \"__main__\":\n",
"> runner.run()\n",
"> ```\n",
"> This allows you to set parameters like the project hash using the command line: \n",
"> ```bash\n",
"> python agent.py --project-hash \"...\"\n",
"> ```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outcome\n",
"\n",
"With these video summaries generated, you can now use them to train your model. As mentioned earlier, a notable example is the CogX paper, which presents an open-source generative text-to-video model, though other approaches are also viable. \n",
"\n",
"Consider leveraging your internal video corpus, using this workflow to generate captions, and fine-tuning your model for your specific use case."
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "cogxvideodemo-yb5agPqU-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}