View on Github Try in Colab Download notebook

No description has been provided for this image

Audio Sentiment Classifier¶

This notebook will walk you through using an agent, powered by AI, to take the sentiment of an audio clip, of a single speaker, to classify it in an Encord Workflow. Sentiment refers to the emotion portrayed by an audio clip.

The agent in this example classifies sentiment into the emotional categories of: sad, happy, angry, neutral, disgust, fearful, surprised, and calm.

Requirements¶

For this notebook you will need:

A Dataset containing seekable audio files in Encord
A Huggingface user access token.

Example Workflow¶

The following Workflow shows how audio files can be pre-labelled with a sentiment classification. After sentiment classifications are applied, the labels are sent to a review stage. An example use case of this Workflow is reviewing the accuracy of a model.

After being labelled with a sentiment, the labels are sent to a review stage. An example use case of this workflow would be for checking the accuracy of a model before using in a wider project.

Installation¶

Ensure the following libraries are installed:

encord-agents library.
transformers a huggingface library
librosa
torch
numpy

In [ ]:

Copied!





!pip install encord-agents
!pip install transformers
!pip install librosa
!pip install torch
!pip install numpy
!pip install encord-agents
!pip install transformers
!pip install librosa
!pip install torch
!pip install numpy

Hugging Face Authentication¶

Retrieve your Hugging Face API token for accessing models. The Hugging Face (User Access) token is an authentication key used for accessing models on the platform.

💡 If running in colab, set the key in secrets under the name HF_TOKEN, you can find this menu in the left sidebar.

If you are not running in a colab notebook, set the environment variable directly, this can be done with:

os.environ["HF_TOKEN"] = """paste-user-access-token-here"""

In [ ]:

Copied!

from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")

Speech-emotion Model setup¶

The following code does two things:

Imports the necessary libraries for the audio processing.
Prepares the variables we are going to use in our agent to classify audio clips.

In [ ]:

Copied!





import librosa
import numpy as np
import torch
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label
import librosa
import numpy as np
import torch
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label

Model Functions¶

The following functions provide functionality for our agent to interface with the model.

preprocess_audio prepares the audio for processing by the model.
predict_emotion uses the model to predict the audio's sentiment.

These functions were retrieved from Hugging Face, here.

In [ ]:

Copied!





def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)

    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_tensors="pt",
    )
    return inputs
def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)

    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_tensors="pt",
    )
    return inputs

In [ ]:

Copied!





def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]

    return predicted_label
def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]

    return predicted_label

Encord Authentication¶

The Encord platform uses ssh-keys for authentication. For use of it with agents, the ENCORD_SSH_KEY needs to be set as an environment variable containing the raw content of the private key file.

If you do not have an ssh key this documentation will guide you to create one.

If you are running in colab, create a user secret called SSH_KEY and set it to your encord ssh key.

If you are not executing this in colab, set it directly as an environment variable with
os.environ['ENCORD_SSH_KEY'] = """paste-ssh-key-here"""

In [ ]:

Copied!

import os

os.environ["ENCORD_SSH_KEY"] = userdata.get("SSH_KEY")
import os

os.environ["ENCORD_SSH_KEY"] = userdata.get("SSH_KEY")

Imports & Variable Names¶

The following script imports the necessary dependencies from the encord / encord-agents libraries to interact with the API so the agent work.

In [ ]:

Copied!





from pathlib import Path
from typing import Annotated

from encord.objects.classification import Classification
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project

from encord_agents.core.dependencies.models import Depends
from encord_agents.tasks import Runner
from encord_agents.tasks.dependencies import dep_asset
from pathlib import Path
from typing import Annotated

from encord.objects.classification import Classification
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project

from encord_agents.core.dependencies.models import Depends
from encord_agents.tasks import Runner
from encord_agents.tasks.dependencies import dep_asset

The following variables need to be set:

PROJECT_HASH: Contains the id of your project, this is how the runner connects to your project later on.
STAGE_ID: Contains the id of the agent stage in your workflow, this is where the function we will create is run.
SENTIMENT_CLASSIFIED_PATHWAY_NAME: The name of the path to send the file on, this does not need to be changed if you have followed the workflow example

In [ ]:

Copied!

PROJECT_HASH = "<YOUR_PROJECT_HASH>"
STAGE_ID = "<YOUR_STAGE_ID>"
SENTIMENT_CLASSIFIED_PATHWAY_NAME = "Sentiment Classified"
PROJECT_HASH = ""
STAGE_ID = ""
SENTIMENT_CLASSIFIED_PATHWAY_NAME = "Sentiment Classified"

Ontology Structure¶

For this project, the following ontology is appropriate to use.

We can use classifications and not objects as we are labelling the whole clip, not sections of it.

Defining the Runner¶

The runner object is what we use to link our functions to our project.

In [ ]:

Copied!

runner = Runner(project_hash=PROJECT_HASH)
runner = Runner(project_hash=PROJECT_HASH)

Defining the Function¶

The function is what is run on the agent stage in our workflow.

The decorator @runner.stage() means that the function classify_by_sentiment runs at the appropriate agent stage and it passes in the parameters of lr, asset and project for us to use.

The function is explained in greater depth with the comments in the script.

In [ ]:

Copied!





# Common variables used to create Audio classifications in Encord
radio_ontology_classification = runner.project.ontology_structure.get_child_by_title(
    title="emotion", type_=Classification
)

# Collects attributes of the classifcation radio options and then creates a dictionary to
# translate the label to an Option object for the API to accept.
attr = radio_ontology_classification.attributes[0]
dict_lbl_to_opt = {option.label: option for option in attr.options}
# Common variables used to create Audio classifications in Encord
radio_ontology_classification = runner.project.ontology_structure.get_child_by_title(
    title="emotion", type_=Classification
)

# Collects attributes of the classifcation radio options and then creates a dictionary to
# translate the label to an Option object for the API to accept.
attr = radio_ontology_classification.attributes[0]
dict_lbl_to_opt = {option.label: option for option in attr.options}

In [ ]:

Copied!





@runner.stage(STAGE_ID, overwrite=True)
def classify_by_sentiment(lr: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
    # Using the model to predict the emotion
    predicted_emotion = predict_emotion(asset, model, feature_extractor, id2label)

    # Prepares an instance to use to add a classification to the audio.
    # As we're classifying audio, we need the range_only=True variable
    classification_instance = radio_ontology_classification.create_instance(range_only=True)

    # To determine which classifcation is added to the asset
    # [ dict_lbl_to_opt("sad"), attr ] `attr` tells the API to focus
    # on top level classifications only

    match predicted_emotion:
        case "sad":
            classification_instance.set_answer(dict_lbl_to_opt["sad"], attr)
        case "happy":
            classification_instance.set_answer(dict_lbl_to_opt["happy"], attr)
        case "angry":
            classification_instance.set_answer(dict_lbl_to_opt["angry"], attr)
        case "neutral":
            classification_instance.set_answer(dict_lbl_to_opt["neutral"], attr)
        case "disgust":
            classification_instance.set_answer(dict_lbl_to_opt["disgust"], attr)
        case "fearful":
            classification_instance.set_answer(dict_lbl_to_opt["fearful"], attr)
        case "surprised":
            classification_instance.set_answer(dict_lbl_to_opt["surprised"], attr)
        case "calm":
            classification_instance.set_answer(dict_lbl_to_opt["calm"], attr)

    # Indicates that the whole audio should be classified, and old classifications should be overwritten
    classification_instance.set_for_frames()
    lr.add_classification_instance(classification_instance)
    lr.save()

    return SENTIMENT_CLASSIFIED_PATHWAY_NAME  # Sentiment Classified Path
@runner.stage(STAGE_ID, overwrite=True)
def classify_by_sentiment(lr: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
    # Using the model to predict the emotion
    predicted_emotion = predict_emotion(asset, model, feature_extractor, id2label)

    # Prepares an instance to use to add a classification to the audio.
    # As we're classifying audio, we need the range_only=True variable
    classification_instance = radio_ontology_classification.create_instance(range_only=True)

    # To determine which classifcation is added to the asset
    # [ dict_lbl_to_opt("sad"), attr ] `attr` tells the API to focus
    # on top level classifications only

    match predicted_emotion:
        case "sad":
            classification_instance.set_answer(dict_lbl_to_opt["sad"], attr)
        case "happy":
            classification_instance.set_answer(dict_lbl_to_opt["happy"], attr)
        case "angry":
            classification_instance.set_answer(dict_lbl_to_opt["angry"], attr)
        case "neutral":
            classification_instance.set_answer(dict_lbl_to_opt["neutral"], attr)
        case "disgust":
            classification_instance.set_answer(dict_lbl_to_opt["disgust"], attr)
        case "fearful":
            classification_instance.set_answer(dict_lbl_to_opt["fearful"], attr)
        case "surprised":
            classification_instance.set_answer(dict_lbl_to_opt["surprised"], attr)
        case "calm":
            classification_instance.set_answer(dict_lbl_to_opt["calm"], attr)

    # Indicates that the whole audio should be classified, and old classifications should be overwritten
    classification_instance.set_for_frames()
    lr.add_classification_instance(classification_instance)
    lr.save()

    return SENTIMENT_CLASSIFIED_PATHWAY_NAME  # Sentiment Classified Path

Running the agent¶

To run the agent in our Workflow, we can call the runner object.

In [ ]:

Copied!

runner()
runner()

Outcome¶

This AI-powered agent performs reliable sentiment and emotion classification on short, single-speaker audio clips. The Encord Workflow in this example shows how the model's accuracy can be evaluated with human review of its generated labels.

View on Github Try in Colab Download notebook