Zero-shot audio classification tasks present a significant challenge in machine learning, particularly when labeled data is scarce. This article explores the application of Hugging Face’s open-source models, specifically the Contrastive Language-Audio Pretraining (CLAP) models, in addressing this task.
The CLAP models leverage contrastive learning techniques to learn representations of audio data without relying on labeled examples during training. The article covers the setup of working environments, building an audio classification pipeline, and considerations such as sampling rates for transformer models. It delves into the architecture and training process of the CLAP models, highlighting their effectiveness in zero-shot audio classification tasks.
Readers interested in zero-shot learning, audio classification, and leveraging pre-trained models for natural language and audio processing tasks will find this article informative and valuable for their research and practical applications.
Table of Contents:
Setting Up Working Environments
Build Audio Classification Pipeline
Sampling Rate for Transformer Models
Zero-Shot Audio Classification
My E-book: Data Science Portfolio for Success Is Out!
I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?
1. Setting Up Working Environments
Let's start by setting up the working environments. First, we will download the packages we will use in this article. We will download the transformers package and the datasets packages from HuggingFace to be able to download the model and the dataset we will work with. We will also download Soundfile and librosa to be able to work with sound files in Python. The librosa library may need to have ffmpeg installed. This page on librosa provides installation instructions for ffmpeg.
!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa
Next, we will download the dataset we will be using from the huggingface hub. The data we will be using is the esc50 dataset, which is a dataset for environmental sound classification. The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:
Animals
Natural soundscapes & water sounds
Human, non-speech sounds
Interior/domestic sounds
Exterior/urban noises
from datasets import load_dataset, load_from_disk
# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
# split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")
Let's explore the first sound clip and see what the data contains. We can see it has first the filename in the dataset and the category also an array of values and the sampling rate for this sound clip.
audio_sample = dataset[0]
audio_sample
{‘filename’: ‘1–100032-A-0.wav’,
‘fold’: 1,
‘target’: 0,
‘category’: ‘dog’,
‘esc10’: True,
‘src_file’: 100032,
‘take’: ‘A’,
‘audio’: {‘path’: None,
‘array’: array([0., 0., 0., …, 0., 0., 0.]),
‘sampling_rate’: 44100}}
You can use the code below to listen for this sound clip in the Jupyter notebook
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
rate=audio_sample["audio"]["sampling_rate"])
2. Build Audio Classification
Pipeline
For the zero-shot audio classification, we will use the Contrastive Language-Audio Pretraining (CLAP) model. Contrastive learning has shown remarkable success in the field of multimodal representation learning. CLAP was proposed in the CLAP: Learning Audio Concepts From Natural Language Supervision paper.