View on Github Try in Colab Download notebook

Audio Transcription¶

This notebook walks you through using an AI-powered Agent to transcribe and diarize audio files. Diarization is the practice of taking an Audio file and splitting out the various speakers and transcribing what they said. It is used to answer the question: "Who spoke when?". The Agent automatically converts speech to text while distinguishing between different speakers

Requirements¶

This notebook guides you through the Workflow template, Ontology and model selection required.

For this notebook, you need:

A Dataset containing Audio files in Encord.
A Hugging Face User Access Token to access the Diarization models.

Example Workflow¶

The following workflow illustrates how audio files can be pre-labeled. The code in this notebook is for the Diarization agent.

Workflow Description:¶

Following the diarization, predictions with a high confidence in the accuracy of the result are passed on for sentiment analysis, whilst those with low confidence in the accuracy are sent to an annotator for manual labeling. This minimises human annotator time and maximises the quality of the final transcripts.

Installation¶

Ensure that you install:

The encord-agents library.
pyannote.audio: A deep learning-based toolkit for speaker diarization, used to identify and differentiate speakers in an audio file.
transformers & accelerate: Hugging Face libraries used for running and optimizing transformer models, which improve transcription accuracy and performance.

In [ ]:

Copied!

!python -m pip install -q encord-agents "pyannote.audio"
!python -m pip install --upgrade -q transformers accelerate
!python -m pip install -q encord-agents "pyannote.audio"
!python -m pip install --upgrade -q transformers accelerate

Hugging Face Authentication¶

Retrieve the Hugging Face API token for authentication when accessing models and datasets. A Hugging Face (User Access) token is an authentication key used to access Hugging Face's models, datasets, and APIs.

💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.
HF_TOKEN = "my-hf-token"

In [ ]:

Copied!

from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")

Log in to HuggingFace and accept the terms of these two models:

Optionally, you may need to accept the terms of this one additionally if an error is raised below

Speaker diarization

Otherwise, the models used in the following scripts will not run.

Defines data structures and imports necessary libraries for audio transcription and speaker diarization.

Imports necessary libraries for deep learning, audio processing, and transcription.
Segment: Represents a time segment in an audio file.
Diary: Stores diarization results, including the speaker and transcribed text.
Diarization: A structured model to hold multiple diarized segments and retrieve unique speakers.

In [ ]:

Copied!





from pathlib import Path
from typing import List, Optional, Union

import numpy as np
import requests
import torch
from encord.objects.frames import Range
from pyannote.audio import Pipeline
from pyannote.core.annotation import Annotation
from pydantic import BaseModel, RootModel
from sympy.physics.units import length
from torchaudio import functional as F
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.audio_utils import ffmpeg_read


class Segment(BaseModel):
    start: float
    end: float

    @property
    def encord_range(self) -> Range:
        return Range(int(0.5 + self.start * 1000), int(0.5 + self.end * 1000))


class Diary(BaseModel):
    segment: Segment
    speaker: str
    text: str = ""


class Diarization(RootModel):
    root: List[Diary]

    @property
    def speakers(self) -> List[str]:
        return sorted(list(set([diary.speaker for diary in self.root])))
from pathlib import Path
from typing import List, Optional, Union

import numpy as np
import requests
import torch
from encord.objects.frames import Range
from pyannote.audio import Pipeline
from pyannote.core.annotation import Annotation
from pydantic import BaseModel, RootModel
from sympy.physics.units import length
from torchaudio import functional as F
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.audio_utils import ffmpeg_read


class Segment(BaseModel):
    start: float
    end: float

    @property
    def encord_range(self) -> Range:
        return Range(int(0.5 + self.start * 1000), int(0.5 + self.end * 1000))


class Diary(BaseModel):
    segment: Segment
    speaker: str
    text: str = ""


class Diarization(RootModel):
    root: List[Diary]

    @property
    def speakers(self) -> List[str]:
        return sorted(list(set([diary.speaker for diary in self.root])))

Defines the Diarizer class for speaker diarization and transcription.

Initialization (__init__): Loads pretrained models for diarization and transcription, setting up processing pipelines.
Preprocessing (preprocess): Converts audio files into a format suitable for diarization and transcription.
Segment Merging (prepare_segments): Combines consecutive segments from the same speaker into a single segment.
Transcription (transcribe_segments): Uses Whisper to transcribe diarized segments in batches.
Full Pipeline (diarize_and_transcribe): Runs diarization and transcription on an audio file, returning structured speaker-labeled transcripts.

In [ ]:

Copied!





class Diarizer:
    def __init__(
        self,
        diarizer_model: str = "pyannote/speaker-diarization-3.1",
        transcription_model: str = "openai/whisper-medium",
    ):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        # Diarization
        self.diarization_pipeline = Pipeline.from_pretrained(diarizer_model, use_auth_token=HF_TOKEN).to(self.device)

        # Transcription
        processor = AutoProcessor.from_pretrained(transcription_model)
        self.sampling_rate = processor.feature_extractor.sampling_rate

        self.whisper_pipeline = pipeline(
            "automatic-speech-recognition",
            model=transcription_model,
            chunk_length_s=30,
        )

    def preprocess(self, inputs):
        with open(inputs, "rb") as f:
            inputs = f.read()

        if isinstance(inputs, bytes):
            inputs = ffmpeg_read(inputs, self.sampling_rate).copy()

        if len(inputs.shape) != 1:
            print("We expect a single channel audio input for ASRDiarizePipeline so we downmix")
            inputs = np.mean(inputs, axis=0, keepdims=True)

        torch_batch_input = torch.from_numpy(inputs).to(torch.float32)[None]
        return inputs, {"waveform": torch_batch_input, "sample_rate": self.sampling_rate}

    @staticmethod
    def prepare_segments(diarization: Annotation) -> Diarization:
        """
        Diarizer output may contain consecutive segments from the same speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...})
        we combine these segments to give overall timestamps for each speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...})
        """

        segments = []
        for segment, track, label in diarization.itertracks(yield_label=True):
            segments.append({"segment": {"start": segment.start, "end": segment.end}, "label": label})

        new_segments = []
        prev_segment = cur_segment = segments[0]

        for i in range(1, len(segments)):
            cur_segment = segments[i]
            if cur_segment["label"] != prev_segment["label"] and i < len(segments):
                new_segments.append(
                    {
                        "segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["start"]},
                        "speaker": prev_segment["label"],
                    }
                )
                prev_segment = segments[i]

        new_segments.append(
            {
                "segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["end"]},
                "speaker": prev_segment["label"],
            }
        )
        return Diarization.model_validate(new_segments)

    def transcribe_segments(self, diaries: Diarization, inputs: np.ndarray, batch_size: int = 10):
        batch = []
        start_index = 0
        for index, diary in enumerate(diaries.root):
            audio_segment_start = int(self.sampling_rate * diary.segment.start)
            audio_segment_end = int(self.sampling_rate * diary.segment.end)
            segment_audio = inputs[audio_segment_start:audio_segment_end]
            if len(batch) < batch_size:
                batch.append(segment_audio)
                continue

            predicted = self.whisper_pipeline(batch)
            for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
                segment.text = pred.get("text", "")

            batch = []
            start_index = index + 1

        if batch:
            predicted = self.whisper_pipeline(batch)
            for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
                segment.text = pred.get("text", "")

        return diaries

    def diarize_and_transcribe(self, audio_file: Path | str):
        """
        Algo:
        1. Diarize => Sections of the audiofile with a "speaker stamp"
        2. For each section: Transcribe
        """
        # apply the pipeline to an audio file
        audio_file = audio_file if isinstance(audio_file, str) else audio_file.as_posix()
        transcription_input, diarization_input = self.preprocess(audio_file)
        diarization = self.diarization_pipeline(diarization_input)
        diary = self.prepare_segments(diarization)
        diary = self.transcribe_segments(diary, transcription_input)
        return diary


diarizer = Diarizer()
# diary = diarizer.diarize_and_transcribe("test_227.wav")
class Diarizer:
    def __init__(
        self,
        diarizer_model: str = "pyannote/speaker-diarization-3.1",
        transcription_model: str = "openai/whisper-medium",
    ):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        # Diarization
        self.diarization_pipeline = Pipeline.from_pretrained(diarizer_model, use_auth_token=HF_TOKEN).to(self.device)

        # Transcription
        processor = AutoProcessor.from_pretrained(transcription_model)
        self.sampling_rate = processor.feature_extractor.sampling_rate

        self.whisper_pipeline = pipeline(
            "automatic-speech-recognition",
            model=transcription_model,
            chunk_length_s=30,
        )

    def preprocess(self, inputs):
        with open(inputs, "rb") as f:
            inputs = f.read()

        if isinstance(inputs, bytes):
            inputs = ffmpeg_read(inputs, self.sampling_rate).copy()

        if len(inputs.shape) != 1:
            print("We expect a single channel audio input for ASRDiarizePipeline so we downmix")
            inputs = np.mean(inputs, axis=0, keepdims=True)

        torch_batch_input = torch.from_numpy(inputs).to(torch.float32)[None]
        return inputs, {"waveform": torch_batch_input, "sample_rate": self.sampling_rate}

    @staticmethod
    def prepare_segments(diarization: Annotation) -> Diarization:
        """
        Diarizer output may contain consecutive segments from the same speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...})
        we combine these segments to give overall timestamps for each speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...})
        """

        segments = []
        for segment, track, label in diarization.itertracks(yield_label=True):
            segments.append({"segment": {"start": segment.start, "end": segment.end}, "label": label})

        new_segments = []
        prev_segment = cur_segment = segments[0]

        for i in range(1, len(segments)):
            cur_segment = segments[i]
            if cur_segment["label"] != prev_segment["label"] and i < len(segments):
                new_segments.append(
                    {
                        "segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["start"]},
                        "speaker": prev_segment["label"],
                    }
                )
                prev_segment = segments[i]

        new_segments.append(
            {
                "segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["end"]},
                "speaker": prev_segment["label"],
            }
        )
        return Diarization.model_validate(new_segments)

    def transcribe_segments(self, diaries: Diarization, inputs: np.ndarray, batch_size: int = 10):
        batch = []
        start_index = 0
        for index, diary in enumerate(diaries.root):
            audio_segment_start = int(self.sampling_rate * diary.segment.start)
            audio_segment_end = int(self.sampling_rate * diary.segment.end)
            segment_audio = inputs[audio_segment_start:audio_segment_end]
            if len(batch) < batch_size:
                batch.append(segment_audio)
                continue

            predicted = self.whisper_pipeline(batch)
            for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
                segment.text = pred.get("text", "")

            batch = []
            start_index = index + 1

        if batch:
            predicted = self.whisper_pipeline(batch)
            for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
                segment.text = pred.get("text", "")

        return diaries

    def diarize_and_transcribe(self, audio_file: Path | str):
        """
        Algo:
        1. Diarize => Sections of the audiofile with a "speaker stamp"
        2. For each section: Transcribe
        """
        # apply the pipeline to an audio file
        audio_file = audio_file if isinstance(audio_file, str) else audio_file.as_posix()
        transcription_input, diarization_input = self.preprocess(audio_file)
        diarization = self.diarization_pipeline(diarization_input)
        diary = self.prepare_segments(diarization)
        diary = self.transcribe_segments(diary, transcription_input)
        return diary


diarizer = Diarizer()
# diary = diarizer.diarize_and_transcribe("test_227.wav")

Encord Authentication¶

Encord uses ssh-keys for authentication. The following is a code cell for setting the ENCORD_SSH_KEY environment variable. It contains the raw content of your private ssh key file.

If you have not setup an ssh key, see our documentation.

💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.
os.environ["ENCORD_SSH_KEY"] = """paste-private-key-here"""

In [ ]:

Copied!

import os

from google.colab import userdata

os.environ["ENCORD_SSH_KEY"] = userdata.get("ENCORD_SSH_KEY")
import os

from google.colab import userdata

os.environ["ENCORD_SSH_KEY"] = userdata.get("ENCORD_SSH_KEY")

Imports and Initialization and Variable Names¶

The Runner is initialized with a project hash, which allows interaction with an Encord project. Ensure that you replace with the ID of your Encord Project.
The code filters objects from the project's ontology structure based on two criteria:

The object must be of type Shape.AUDIO (audio object).
The object's title must contain the word "speaker" (SPEAKER_INDICATOR) (case-insensitive).
The object should have a nested text classification named "utterance #transcript" in which the transcript will be stored.

The objects that meet the filtering criteria are stored.

⚠️ NOTE: Change these variables for your project!

In [ ]:

Copied!





UTTERANCE_NAME = "utterance #transcript"
PROJECT_HASH = "<your-project-hash>"
HIGH_CONFIDENCE_PATHWAY_NAME = "high confidence"
LOW_CONFIDENCE_PATHWAY_NAME = "low confidence"
SPEAKER_INDICATOR = "speaker"
DIARIZATION_WORKFLOW_STAGE_NAME = "Diarization"
UTTERANCE_NAME = "utterance #transcript"
PROJECT_HASH = ""
HIGH_CONFIDENCE_PATHWAY_NAME = "high confidence"
LOW_CONFIDENCE_PATHWAY_NAME = "low confidence"
SPEAKER_INDICATOR = "speaker"
DIARIZATION_WORKFLOW_STAGE_NAME = "Diarization"

Ontology example¶

Note: the #transcript suffix is not rendered here but allows the editor to 'bind' onto this attribute and render the transcript. This is discussed here

Pre-execution Validation¶

To ensure that your project is of an appropriate form to run the above agent, we can perform pre-execution checks that the relevant Workflow and Ontology are in place.

In [ ]:

Copied!





from pathlib import Path
from typing import Annotated

from encord.objects import Object, Shape
from encord.objects.attributes import TextAttribute
from encord.objects.coordinates import AudioCoordinates
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.workflow.stages.agent import AgentStage

from encord_agents.tasks import Depends, Runner
from encord_agents.tasks.dependencies import dep_asset


def pre_execution_validation(runner: Runner) -> None:
    project = runner.project
    assert runner.project

    diarization_stage = project.workflow.get_stage(name=DIARIZATION_WORKFLOW_STAGE_NAME, type_=AgentStage)
    assert diarization_stage.pathways
    assert {HIGH_CONFIDENCE_PATHWAY_NAME, LOW_CONFIDENCE_PATHWAY_NAME}.issubset(
        {pathway.name for pathway in diarization_stage.pathways}
    )

    assert any(object.shape == Shape.AUDIO for object in project.ontology_structure.objects)
    audio_objects = [
        object
        for object in project.ontology_structure.objects
        if object.shape == Shape.AUDIO and object.title == SPEAKER_INDICATOR
    ]
    if len(audio_objects) > 1:
        print(f"There are multiple Audio {SPEAKER_INDICATOR=} objects")

    def is_acceptable_audio(object: Object) -> bool:
        try:
            object.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)
            return True
        except Exception:
            return False

    assert any(is_acceptable_audio(audio_obj) for audio_obj in audio_objects)
from pathlib import Path
from typing import Annotated

from encord.objects import Object, Shape
from encord.objects.attributes import TextAttribute
from encord.objects.coordinates import AudioCoordinates
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.workflow.stages.agent import AgentStage

from encord_agents.tasks import Depends, Runner
from encord_agents.tasks.dependencies import dep_asset


def pre_execution_validation(runner: Runner) -> None:
    project = runner.project
    assert runner.project

    diarization_stage = project.workflow.get_stage(name=DIARIZATION_WORKFLOW_STAGE_NAME, type_=AgentStage)
    assert diarization_stage.pathways
    assert {HIGH_CONFIDENCE_PATHWAY_NAME, LOW_CONFIDENCE_PATHWAY_NAME}.issubset(
        {pathway.name for pathway in diarization_stage.pathways}
    )

    assert any(object.shape == Shape.AUDIO for object in project.ontology_structure.objects)
    audio_objects = [
        object
        for object in project.ontology_structure.objects
        if object.shape == Shape.AUDIO and object.title == SPEAKER_INDICATOR
    ]
    if len(audio_objects) > 1:
        print(f"There are multiple Audio {SPEAKER_INDICATOR=} objects")

    def is_acceptable_audio(object: Object) -> bool:
        try:
            object.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)
            return True
        except Exception:
            return False

    assert any(is_acceptable_audio(audio_obj) for audio_obj in audio_objects)

In [ ]:

Copied!





runner = Runner(project_hash=PROJECT_HASH, pre_execution_callback=pre_execution_validation)
speakers = [
    o
    for o in runner.project.ontology_structure.objects
    if o.shape == Shape.AUDIO and SPEAKER_INDICATOR in o.title.lower()
]
runner = Runner(project_hash=PROJECT_HASH, pre_execution_callback=pre_execution_validation)
speakers = [
    o
    for o in runner.project.ontology_structure.objects
    if o.shape == Shape.AUDIO and SPEAKER_INDICATOR in o.title.lower()
]

Define Functions¶

The following function creates the audio transcription annotations on the selected objects.

In [ ]:

Copied!





def annotate_transcription(diaries: Diarization, label_row: LabelRowV2) -> bool:
    speaker_lookup = dict(zip(diaries.speakers, speakers))
    added_any = False

    for diary in diaries.root:
        speaker_clf = speaker_lookup.get(diary.speaker)
        if speaker_clf is None:
            continue
        utterance_attr = speaker_clf.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)

        ins = speaker_clf.create_instance()
        ins.set_answer(diary.text, attribute=utterance_attr)
        ins.set_for_frames(coordinates=AudioCoordinates(range=[diary.segment.encord_range]))
        label_row.add_object_instance(ins)
        added_any = True
    return added_any
def annotate_transcription(diaries: Diarization, label_row: LabelRowV2) -> bool:
    speaker_lookup = dict(zip(diaries.speakers, speakers))
    added_any = False

    for diary in diaries.root:
        speaker_clf = speaker_lookup.get(diary.speaker)
        if speaker_clf is None:
            continue
        utterance_attr = speaker_clf.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)

        ins = speaker_clf.create_instance()
        ins.set_answer(diary.text, attribute=utterance_attr)
        ins.set_for_frames(coordinates=AudioCoordinates(range=[diary.segment.encord_range]))
        label_row.add_object_instance(ins)
        added_any = True
    return added_any

Define the diarization function for speaker identification and transcription, and call the annotate_transcription function defined above.

In [ ]:

Copied!





@runner.stage(DIARIZATION_WORKFLOW_STAGE_NAME)
def do_diarization(label_row: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
    diaries = diarizer.diarize_and_transcribe(asset)
    if annotate_transcription(diaries, label_row):
        label_row.save()
        return HIGH_CONFIDENCE_PATHWAY_NAME
    else:
        return LOW_CONFIDENCE_PATHWAY_NAME
@runner.stage(DIARIZATION_WORKFLOW_STAGE_NAME)
def do_diarization(label_row: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
    diaries = diarizer.diarize_and_transcribe(asset)
    if annotate_transcription(diaries, label_row):
        label_row.save()
        return HIGH_CONFIDENCE_PATHWAY_NAME
    else:
        return LOW_CONFIDENCE_PATHWAY_NAME

Run the Agent¶

Initialize the runner and set the task_batch_size to 1.

We encourage you first to try it out with max_tasks_per_stage=1 to first check that the Agent is working appropriately for your use-case.

💡Hint: If you want to execute this as a Python script, you can run it as a command line interface by putting the above code in an agents.py file and replacing
runner(...)
with
if __name__ == "__main__":
    runner.run()
Which allows you to set, for example the Project hash using the command line:
python agent.py --project-hash "..."

In [ ]:

Copied!

runner(task_batch_size=1, max_tasks_per_stage=None)
runner(task_batch_size=1, max_tasks_per_stage=None)

Outcome¶

You have created an AI-powered agent to transcribe and diarize audio files. The agent has converts speech to text while distinguishing between different speakers, providing a clear, structured transcript that answers the question: "Who spoke when?" This automated process simplifies conversation analysis and enhances understanding of recorded audio.

Next Steps¶

You can do sentiment analysis on your transcriptions.

What to do in Case of Installation Errors¶

If the following error occurs during installation:

Try running the code cell below before installing again. It typically happens at the later installs.

In [ ]:

Copied!

import locale

def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
import locale

def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding

View on Github Try in Colab Download notebook