A Practical Approach to Analyzing Emotions Over Time in Speech
With the upcoming release of advanced voice methods by OpenAI, we’d like to explore what open-source technology is currently available for speech emotion recognition. Specifically, we will explore the use of the SpeechBrain toolkit to recognize, in real-time, emotion in speech. While the technology is extensive, we acknowledge that the open-source landscape will need further development for serious applications in emotion recognition.
Nonetheless, to make this practical, we will use a short speech by Jerome Powell about interest rates. We will detect and visualize emotions over time. For those interested in trying this technology, an end-to-end implementation in Python will be provided. As you’ll notice, SpeechBrain offers a wide array of speech analysis technologies worth exploring further.
1. SpeechBrain
SpeechBrain is an open-source toolkit designed for speech processing, built on the PyTorch framework. It aims to facilitate the development and research of various speech-related applications, such as speech recognition, speaker recognition, speech enhancement, and multi-microphone signal processing.
Here are some key features and uses:
Speech Recognition: Converting spoken language into text.
Speaker Recognition: Recognizing and identifying individual speakers.
Speech Separation: Separating individual voices from mixed audio.
Speech Enhancement: Improving the quality and intelligibility of speech signals.
Interpretability: Making audio models more understandable.
Speech Generation: Creating new speech from data.
Text-to-Speech: Converting text into spoken language.
Vocoding: Synthesizing high-quality audio from speech features.
Spoken Language Understanding: Extracting meaning from spoken language.
Speech-to-Speech Translation: Translating spoken language from one language to another.
Speech Translation: Translating spoken language into text in another language.
Emotion Classification: Detecting emotions from speech.
Language Identification: Recognizing the language of spoken audio.
Voice Activity Detection: Detecting the presence of speech in audio.
Sound Classification: Classifying different types of sounds.
Self-Supervised Learning: Training models without labeled data.
Metric Learning: Learning meaningful representations of data.
Alignment: Aligning speech with text.
Diarization: Segmenting and classifying speakers in audio.
For further information, you can review the following resources:
2. Python Implementation
We will implement the SpeechBrain emotion recognition technology by first downloading a YouTube video and converting it to MP3. Then, we will divide the MP3 into chunks and classify the emotion of each chunk using a pre-trained wav2vec2 (base) model from Hugging Face libraries. Finally, we will visualize the emotions over time by producing an output that incorporates both the speech audio waves and the predictions over time.
2.1 Setting Up the Environment
First, we need to install the necessary libraries. These libraries include SpeechBrain for emotion recognition, PyAudio and MoviePy for audio and video processing, Pydub for audio manipulation, and yt-dlp for downloading YouTube videos.
!pip install speechbrain
!pip install pyaudio
!pip install moviepy
!pip install pydub
!pip install pytube
!apt-get install ffmpeg
!pip install yt-dlp
import torch # PyTorch library for tensor computations and deep learning
import torchaudio # PyTorch audio library for processing audio data
from speechbrain.inference.interfaces import foreign_class # Importing foreign_class for loading the emotion recognition model from SpeechBrain
from IPython.display import Audio, display # IPython display modules for displaying audio and other HTML content in notebooks
import matplotlib.pyplot as plt # Matplotlib for creating plots and visualizations
import numpy as np # NumPy for numerical operations on arrays
from IPython.display import display, HTML, Video # IPython display modules for displaying HTML and video content in notebooks
import yt_dlp # YouTube-DL Python wrapper for downloading videos from YouTube
import torch # PyTorch library for tensor computations and deep learning (redundant import)
from moviepy.editor import VideoFileClip, concatenate_videoclips, CompositeVideoClip, clips_array # MoviePy for video editing and compositing
import moviepy.editor as mpy # Alias for MoviePy to use its functions
from matplotlib.animation import FuncAnimation # Matplotlib's animation module for creating animations
import os # OS module for interacting with the operating system, such as file operations
import shutil # Shutil module for high-level file operations, such as copying and removing files
from pydub import AudioSegment # Pydub for audio file manipulation, such as converting formats and segmenting
2.2 Download Youtube Video and Convert to MP3
Next, we download a YouTube video of Jerome Powell’s speech about interest rates:
# Define options for yt-dlp to download the video
video_opts = {
'format': 'bestvideo+bestaudio/best',
'outtmpl': 'video.%(ext)s',
}
# URL of the YouTube video
url = 'https://www.youtube.com/watch?v=_YgP6ocp_qs&ab_channel=CNBCTelevision'
# Download the video
with yt_dlp.YoutubeDL(video_opts) as ydl:
ydl.download([url])
# Play the video
display(Video('video.webm'))
We can then convert the video to mp3. This allows us to analyze the audio content in the format required by the emotion recognition pre-trained model.
# Load the downloaded video
video = AudioSegment.from_file('video.webm')
# Export as MP3
video.export('audio.mp3', format='mp3')
# display audio
display(Audio('audio.mp3'))
Also worth reading:
Real-Time Emotion Recognition In Python With OpenCV And FER
Newsletter