Audio Sentiment Classifier¶
This notebook will walk you through using an agent, powered by AI, to take the sentiment of an audio clip, of a single speaker, to classify it in an Encord Workflow. Sentiment refers to the emotion portrayed by an audio clip.
The agent in this example classifies sentiment into the emotional categories of: sad, happy, angry, neutral, disgust, fearful, surprised, and calm.
Requirements¶
For this notebook you will need:
- A Dataset containing seekable audio files in Encord
- A Huggingface user access token.
Example Workflow¶
The following Workflow shows how audio files can be pre-labelled with a sentiment classification. After sentiment classifications are applied, the labels are sent to a review stage. An example use case of this Workflow is reviewing the accuracy of a model.
After being labelled with a sentiment, the labels are sent to a review stage. An example use case of this workflow would be for checking the accuracy of a model before using in a wider project.
Installation¶
Ensure the following libraries are installed:
encord-agents
library.transformers
a huggingface librarylibrosa
torch
numpy
!pip install encord-agents
!pip install transformers
!pip install librosa
!pip install torch
!pip install numpy
Hugging Face Authentication¶
Retrieve your Hugging Face API token for accessing models. The Hugging Face (User Access) token is an authentication key used for accessing models on the platform.
💡 If running in colab, set the key in secrets under the name
HF_TOKEN
, you can find this menu in the left sidebar.If you are not running in a colab notebook, set the environment variable directly, this can be done with:
os.environ["HF_TOKEN"] = """paste-user-access-token-here"""
from google.colab import userdata
HF_TOKEN = userdata.get("HF_TOKEN")
Speech-emotion Model setup¶
The following code does two things:
- Imports the necessary libraries for the audio processing.
- Prepares the variables we are going to use in our agent to classify audio clips.
import librosa
import numpy as np
import torch
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label
def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
max_length = int(feature_extractor.sampling_rate * max_duration)
if len(audio_array) > max_length:
audio_array = audio_array[:max_length]
else:
audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))
inputs = feature_extractor(
audio_array,
sampling_rate=feature_extractor.sampling_rate,
max_length=max_length,
truncation=True,
return_tensors="pt",
)
return inputs
def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_id = torch.argmax(logits, dim=-1).item()
predicted_label = id2label[predicted_id]
return predicted_label
Encord Authentication¶
The Encord platform uses ssh-keys for authentication. For use of it with agents, the ENCORD_SSH_KEY
needs to be set as an environment variable containing the raw content of the private key file.
If you do not have an ssh key this documentation will guide you to create one.
If you are running in colab, create a user secret called SSH_KEY and set it to your encord ssh key.
If you are not executing this in colab, set it directly as an environment variable with
os.environ['ENCORD_SSH_KEY'] = """paste-ssh-key-here"""
import os
os.environ["ENCORD_SSH_KEY"] = userdata.get("SSH_KEY")
Imports & Variable Names¶
The following script imports the necessary dependencies from the encord
/ encord-agents
libraries to interact with the API so the agent work.
from pathlib import Path
from typing import Annotated
from encord.objects.classification import Classification
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project
from encord_agents.core.dependencies.models import Depends
from encord_agents.tasks import Runner
from encord_agents.tasks.dependencies import dep_asset
The following variables need to be set:
PROJECT_HASH
: Contains the id of your project, this is how the runner connects to your project later on.STAGE_ID
: Contains the id of the agent stage in your workflow, this is where the function we will create is run.SENTIMENT_CLASSIFIED_PATHWAY_NAME
: The name of the path to send the file on, this does not need to be changed if you have followed the workflow example
PROJECT_HASH = "<YOUR_PROJECT_HASH>"
STAGE_ID = "<YOUR_STAGE_ID>"
SENTIMENT_CLASSIFIED_PATHWAY_NAME = "Sentiment Classified"
We can use classifications and not objects as we are labelling the whole clip, not sections of it.
Defining the Runner¶
The runner object is what we use to link our functions to our project.
runner = Runner(project_hash=PROJECT_HASH)
Defining the Function¶
The function is what is run on the agent stage in our workflow.
The decorator @runner.stage()
means that the function classify_by_sentiment
runs at the appropriate agent stage and it passes in the parameters of lr
, asset
and project
for us to use.
The function is explained in greater depth with the comments in the script.
# Common variables used to create Audio classifications in Encord
radio_ontology_classification = runner.project.ontology_structure.get_child_by_title(
title="emotion", type_=Classification
)
# Collects attributes of the classifcation radio options and then creates a dictionary to
# translate the label to an Option object for the API to accept.
attr = radio_ontology_classification.attributes[0]
dict_lbl_to_opt = {option.label: option for option in attr.options}
@runner.stage(STAGE_ID, overwrite=True)
def classify_by_sentiment(lr: LabelRowV2, asset: Annotated[Path, Depends(dep_asset)]):
# Using the model to predict the emotion
predicted_emotion = predict_emotion(asset, model, feature_extractor, id2label)
# Prepares an instance to use to add a classification to the audio.
# As we're classifying audio, we need the range_only=True variable
classification_instance = radio_ontology_classification.create_instance(range_only=True)
# To determine which classifcation is added to the asset
# [ dict_lbl_to_opt("sad"), attr ] `attr` tells the API to focus
# on top level classifications only
match predicted_emotion:
case "sad":
classification_instance.set_answer(dict_lbl_to_opt["sad"], attr)
case "happy":
classification_instance.set_answer(dict_lbl_to_opt["happy"], attr)
case "angry":
classification_instance.set_answer(dict_lbl_to_opt["angry"], attr)
case "neutral":
classification_instance.set_answer(dict_lbl_to_opt["neutral"], attr)
case "disgust":
classification_instance.set_answer(dict_lbl_to_opt["disgust"], attr)
case "fearful":
classification_instance.set_answer(dict_lbl_to_opt["fearful"], attr)
case "surprised":
classification_instance.set_answer(dict_lbl_to_opt["surprised"], attr)
case "calm":
classification_instance.set_answer(dict_lbl_to_opt["calm"], attr)
# Indicates that the whole audio should be classified, and old classifications should be overwritten
classification_instance.set_for_frames()
lr.add_classification_instance(classification_instance)
lr.save()
return SENTIMENT_CLASSIFIED_PATHWAY_NAME # Sentiment Classified Path
Running the agent¶
To run the agent in our Workflow, we can call the runner object.
runner()
Outcome¶
This AI-powered agent performs reliable sentiment and emotion classification on short, single-speaker audio clips. The Encord Workflow in this example shows how the model's accuracy can be evaluated with human review of its generated labels.