Audio Transcription¶
This notebook walks you through using an AI-powered Agent to transcribe and diarize audio files. Diarization is the practice of taking an Audio file and splitting out the various speakers and transcribing what they said. It is used to answer the question: "Who spoke when?". The Agent automatically converts speech to text while distinguishing between different speakers
Requirements¶
This notebook guides you through the Workflow template, Ontology and model selection required.
For this notebook, you need:
- A Dataset containing Audio files in Encord.
- A Hugging Face User Access Token to access the Diarization models.
Example Workflow¶
The following workflow illustrates how audio files can be pre-labeled. The code in this notebook is for the Diarization
agent.
Workflow Description:¶
Following the diarization, predictions with a high confidence in the accuracy of the result are passed on for sentiment analysis, whilst those with low confidence in the accuracy are sent to an annotator for manual labeling. This minimises human annotator time and maximises the quality of the final transcripts.
Installation¶
Ensure that you install:
- The
encord-agents
library. pyannote.audio
: A deep learning-based toolkit for speaker diarization, used to identify and differentiate speakers in an audio file.transformers & accelerate
: Hugging Face libraries used for running and optimizing transformer models, which improve transcription accuracy and performance.
!python -m pip install -q encord-agents "pyannote.audio"
!python -m pip install --upgrade -q transformers accelerate
Hugging Face Authentication¶
Retrieve the Hugging Face API token for authentication when accessing models and datasets. A Hugging Face (User Access) token is an authentication key used to access Hugging Face's models, datasets, and APIs.
💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.
HF_TOKEN = "my-hf-token"
from google.colab import userdata
HF_TOKEN = userdata.get("HF_TOKEN")
Log in to HuggingFace and accept the terms of these two models:
Optionally, you may need to accept the terms of this one additionally if an error is raised below
Otherwise, the models used in the following scripts will not run.
Defines data structures and imports necessary libraries for audio transcription and speaker diarization.
- Imports necessary libraries for deep learning, audio processing, and transcription.
Segment
: Represents a time segment in an audio file.Diary
: Stores diarization results, including the speaker and transcribed text.Diarization
: A structured model to hold multiple diarized segments and retrieve unique speakers.
from pathlib import Path
from typing import List, Optional, Union
import numpy as np
import requests
import torch
from encord.objects.frames import Range
from pyannote.audio import Pipeline
from pyannote.core.annotation import Annotation
from pydantic import BaseModel, RootModel
from sympy.physics.units import length
from torchaudio import functional as F
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.audio_utils import ffmpeg_read
class Segment(BaseModel):
start: float
end: float
@property
def encord_range(self) -> Range:
return Range(int(0.5 + self.start * 1000), int(0.5 + self.end * 1000))
class Diary(BaseModel):
segment: Segment
speaker: str
text: str = ""
class Diarization(RootModel):
root: List[Diary]
@property
def speakers(self) -> List[str]:
return sorted(list(set([diary.speaker for diary in self.root])))
Defines the Diarizer
class for speaker diarization and transcription.
- Initialization (
__init__
): Loads pretrained models for diarization and transcription, setting up processing pipelines. - Preprocessing (
preprocess
): Converts audio files into a format suitable for diarization and transcription. - Segment Merging (
prepare_segments
): Combines consecutive segments from the same speaker into a single segment. - Transcription (
transcribe_segments
): Uses Whisper to transcribe diarized segments in batches. - Full Pipeline (
diarize_and_transcribe
): Runs diarization and transcription on an audio file, returning structured speaker-labeled transcripts.
class Diarizer:
def __init__(
self,
diarizer_model: str = "pyannote/speaker-diarization-3.1",
transcription_model: str = "openai/whisper-medium",
):
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Diarization
self.diarization_pipeline = Pipeline.from_pretrained(diarizer_model, use_auth_token=HF_TOKEN).to(self.device)
# Transcription
processor = AutoProcessor.from_pretrained(transcription_model)
self.sampling_rate = processor.feature_extractor.sampling_rate
self.whisper_pipeline = pipeline(
"automatic-speech-recognition",
model=transcription_model,
chunk_length_s=30,
)
def preprocess(self, inputs):
with open(inputs, "rb") as f:
inputs = f.read()
if isinstance(inputs, bytes):
inputs = ffmpeg_read(inputs, self.sampling_rate).copy()
if len(inputs.shape) != 1:
print("We expect a single channel audio input for ASRDiarizePipeline so we downmix")
inputs = np.mean(inputs, axis=0, keepdims=True)
torch_batch_input = torch.from_numpy(inputs).to(torch.float32)[None]
return inputs, {"waveform": torch_batch_input, "sample_rate": self.sampling_rate}
@staticmethod
def prepare_segments(diarization: Annotation) -> Diarization:
"""
Diarizer output may contain consecutive segments from the same speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...})
we combine these segments to give overall timestamps for each speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...})
"""
segments = []
for segment, track, label in diarization.itertracks(yield_label=True):
segments.append({"segment": {"start": segment.start, "end": segment.end}, "label": label})
new_segments = []
prev_segment = cur_segment = segments[0]
for i in range(1, len(segments)):
cur_segment = segments[i]
if cur_segment["label"] != prev_segment["label"] and i < len(segments):
new_segments.append(
{
"segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["start"]},
"speaker": prev_segment["label"],
}
)
prev_segment = segments[i]
new_segments.append(
{
"segment": {"start": prev_segment["segment"]["start"], "end": cur_segment["segment"]["end"]},
"speaker": prev_segment["label"],
}
)
return Diarization.model_validate(new_segments)
def transcribe_segments(self, diaries: Diarization, inputs: np.ndarray, batch_size: int = 10):
batch = []
start_index = 0
for index, diary in enumerate(diaries.root):
audio_segment_start = int(self.sampling_rate * diary.segment.start)
audio_segment_end = int(self.sampling_rate * diary.segment.end)
segment_audio = inputs[audio_segment_start:audio_segment_end]
if len(batch) < batch_size:
batch.append(segment_audio)
continue
predicted = self.whisper_pipeline(batch)
for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
segment.text = pred.get("text", "")
batch = []
start_index = index + 1
if batch:
predicted = self.whisper_pipeline(batch)
for pred, segment in zip(predicted, diaries.root[start_index : start_index + len(batch)]):
segment.text = pred.get("text", "")
return diaries
def diarize_and_transcribe(self, audio_file: Path | str):
"""
Algo:
1. Diarize => Sections of the audiofile with a "speaker stamp"
2. For each section: Transcribe
"""
# apply the pipeline to an audio file
audio_file = audio_file if isinstance(audio_file, str) else audio_file.as_posix()
transcription_input, diarization_input = self.preprocess(audio_file)
diarization = self.diarization_pipeline(diarization_input)
diary = self.prepare_segments(diarization)
diary = self.transcribe_segments(diary, transcription_input)
return diary
diarizer = Diarizer()
# diary = diarizer.diarize_and_transcribe("test_227.wav")
Encord Authentication¶
Encord uses ssh-keys for authentication. The following is a code cell for setting the ENCORD_SSH_KEY
environment variable. It contains the raw content of your private ssh key file.
If you have not setup an ssh key, see our documentation.
💡 In colab, you can set the key once in the secrets in the left sidebar and load it in new notebooks. IF YOU ARE NOT RUNNING THE CODE IN THE COLLAB NOTEBOOK, you must set the environment variable directly.
os.environ["ENCORD_SSH_KEY"] = """paste-private-key-here"""
import os
from google.colab import userdata
os.environ["ENCORD_SSH_KEY"] = userdata.get("ENCORD_SSH_KEY")
Imports and Initialization and Variable Names¶
The Runner is initialized with a project hash, which allows interaction with an Encord project. Ensure that you replace
with the ID of your Encord Project. The code filters objects from the project's ontology structure based on two criteria:
- The object must be of type Shape.AUDIO (audio object).
- The object's title must contain the word "speaker" (SPEAKER_INDICATOR) (case-insensitive).
- The object should have a nested text classification named
"utterance #transcript"
in which the transcript will be stored.
- The objects that meet the filtering criteria are stored.
⚠️ NOTE: Change these variables for your project!
UTTERANCE_NAME = "utterance #transcript"
PROJECT_HASH = "<your-project-hash>"
HIGH_CONFIDENCE_PATHWAY_NAME = "high confidence"
LOW_CONFIDENCE_PATHWAY_NAME = "low confidence"
SPEAKER_INDICATOR = "speaker"
DIARIZATION_WORKFLOW_STAGE_NAME = "Diarization"
Pre-execution Validation¶
To ensure that your project is of an appropriate form to run the above agent, we can perform pre-execution checks that the relevant Workflow and Ontology are in place.
from pathlib import Path
from typing import Annotated
from encord.objects import Object, Shape
from encord.objects.attributes import TextAttribute
from encord.objects.coordinates import AudioCoordinates
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.workflow.stages.agent import AgentStage
from encord_agents.tasks import Depends, Runner
from encord_agents.tasks.dependencies import dep_asset
def pre_execution_validation(runner: Runner) -> None:
project = runner.project
assert runner.project
diarization_stage = project.workflow.get_stage(name=DIARIZATION_WORKFLOW_STAGE_NAME, type_=AgentStage)
assert diarization_stage.pathways
assert {HIGH_CONFIDENCE_PATHWAY_NAME, LOW_CONFIDENCE_PATHWAY_NAME}.issubset(
{pathway.name for pathway in diarization_stage.pathways}
)
assert any(object.shape == Shape.AUDIO for object in project.ontology_structure.objects)
audio_objects = [
object
for object in project.ontology_structure.objects
if object.shape == Shape.AUDIO and object.title == SPEAKER_INDICATOR
]
if len(audio_objects) > 1:
print(f"There are multiple Audio {SPEAKER_INDICATOR=} objects")
def is_acceptable_audio(object: Object) -> bool:
try:
object.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)
return True
except Exception:
return False
assert any(is_acceptable_audio(audio_obj) for audio_obj in audio_objects)
runner = Runner(project_hash=PROJECT_HASH, pre_execution_callback=pre_execution_validation)
speakers = [
o
for o in runner.project.ontology_structure.objects
if o.shape == Shape.AUDIO and SPEAKER_INDICATOR in o.title.lower()
]
Define Functions¶
The following function creates the audio transcription annotations on the selected objects.
def annotate_transcription(diaries: Diarization, label_row: LabelRowV2) -> bool:
speaker_lookup = dict(zip(diaries.speakers, speakers))
added_any = False
for diary in diaries.root:
speaker_clf = speaker_lookup.get(diary.speaker)
if speaker_clf is None:
continue
utterance_attr = speaker_clf.get_child_by_title(UTTERANCE_NAME, type_=TextAttribute)
ins = speaker_clf.create_instance()
ins.set_answer(diary.text, attribute=utterance_attr)
ins.set_for_frames(coordinates=AudioCoordinates(range=[diary.segment.encord_range]))
label_row.add_object_instance(ins)
added_any = True
return added_any
Define the diarization function for speaker identification and transcription, and call the annotate_transcription
function defined above.
@runner.stage(DIARIZATION_WORKFLOW_STAGE_NAME)
def do_diarization(label_row: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
diaries = diarizer.diarize_and_transcribe(asset)
if annotate_transcription(diaries, label_row):
label_row.save()
return HIGH_CONFIDENCE_PATHWAY_NAME
else:
return LOW_CONFIDENCE_PATHWAY_NAME
Run the Agent¶
Initialize the runner and set the task_batch_size to 1.
We encourage you first to try it out with max_tasks_per_stage=1
to first check that the Agent is working appropriately for your use-case.
💡Hint: If you want to execute this as a Python script, you can run it as a command line interface by putting the above code in an
agents.py
file and replacingrunner(...)with
if __name__ == "__main__": runner.run()Which allows you to set, for example the Project hash using the command line:
python agent.py --project-hash "..."
runner(task_batch_size=1, max_tasks_per_stage=None)
Outcome¶
You have created an AI-powered agent to transcribe and diarize audio files. The agent has converts speech to text while distinguishing between different speakers, providing a clear, structured transcript that answers the question: "Who spoke when?" This automated process simplifies conversation analysis and enhances understanding of recorded audio.
Next Steps¶
You can do sentiment analysis on your transcriptions.
What to do in Case of Installation Errors¶
If the following error occurs during installation:
Try running the code cell below before installing again. It typically happens at the later installs.
import locale
def getpreferredencoding(do_setlocale=True):
return "UTF-8"
locale.getpreferredencoding = getpreferredencoding