To Data & Beyond

To Data & Beyond

Share this post

To Data & Beyond
To Data & Beyond
Zero-Shot Audio Classification Using HuggingFace CLAP Open-Source Model

Zero-Shot Audio Classification Using HuggingFace CLAP Open-Source Model

Youssef Hosni's avatar
Youssef Hosni
May 12, 2024
∙ Paid
3

Share this post

To Data & Beyond
To Data & Beyond
Zero-Shot Audio Classification Using HuggingFace CLAP Open-Source Model
1
Share

To Data & Beyond is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Zero-shot audio classification tasks present a significant challenge in machine learning, particularly when labeled data is scarce. This article explores the application of Hugging Face’s open-source models, specifically the Contrastive Language-Audio Pretraining (CLAP) models, in addressing this task. 

The CLAP models leverage contrastive learning techniques to learn representations of audio data without relying on labeled examples during training. The article covers the setup of working environments, building an audio classification pipeline, and considerations such as sampling rates for transformer models. It delves into the architecture and training process of the CLAP models, highlighting their effectiveness in zero-shot audio classification tasks. 

Readers interested in zero-shot learning, audio classification, and leveraging pre-trained models for natural language and audio processing tasks will find this article informative and valuable for their research and practical applications.

Table of Contents:

  1. Setting Up Working Environments

  2. Build Audio Classification Pipeline

  3. Sampling Rate for Transformer Models

  4. Zero-Shot Audio Classification 


My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni
·
September 15, 2023
My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. Setting Up Working Environments

Let's start by setting up the working environments. First, we will download the packages we will use in this article. We will download the transformers package and the datasets packages from HuggingFace to be able to download the model and the dataset we will work with. We will also download Soundfile and librosa to be able to work with sound files in Python. The librosa library may need to have ffmpeg installed. This page on librosa provides installation instructions for ffmpeg.

    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa

Next, we will download the dataset we will be using from the huggingface hub. The data we will be using is the esc50 dataset, which is a dataset for environmental sound classification. The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

  • Animals

  • Natural soundscapes & water sounds

  • Human, non-speech sounds

  • Interior/domestic sounds

  • Exterior/urban noises

from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

Let's explore the first sound clip and see what the data contains. We can see it has first the filename in the dataset and the category also an array of values and the sampling rate for this sound clip. 

audio_sample = dataset[0]
audio_sample

{‘filename’: ‘1–100032-A-0.wav’,
 ‘fold’: 1,
 ‘target’: 0,
 ‘category’: ‘dog’,
 ‘esc10’: True,
 ‘src_file’: 100032,
 ‘take’: ‘A’,
 ‘audio’: {‘path’: None,
 ‘array’: array([0., 0., 0., …, 0., 0., 0.]),
 ‘sampling_rate’: 44100}}

You can use the code below to listen for this sound clip in the Jupyter notebook

from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

2. Build Audio Classification Pipeline 

For the zero-shot audio classification, we will use the Contrastive Language-Audio Pretraining (CLAP) model. Contrastive learning has shown remarkable success in the field of multimodal representation learning. CLAP was proposed in the CLAP: Learning Audio Concepts From Natural Language Supervision paper. 

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Youssef Hosni
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share