Building & Deploying a Speech Recognition System Using the Whisper Model & Gradio

May 26, 2024

Speech recognition is the task of converting spoken language into text. This article provides a comprehensive guide on building and deploying a speech recognition system using OpenAI’s Whisper model and Gradio.

The process begins with setting up the working environment, including the installation of necessary packages such as HuggingFace’s transformers and datasets, as well as soundfile, librosa, and radio.

The dataset used is the LibriSpeech corpus, loaded from the HuggingFace dataset hub. Detailed instructions are provided for exploring and listening to the dataset samples.

Next, the article explains how to construct a Transformers pipeline utilizing the distilled version of the Whisper model, optimized for faster and smaller speech recognition tasks while maintaining high accuracy. The deployment section demonstrates how to create a user-friendly web application using Gradio.

This application allows for real-time speech transcription via microphone input or uploaded audio files. The final product is a robust, interactive interface for speech-to-text conversion, complete with step-by-step code examples and deployment instructions.

My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni

September 15, 2023

My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. Setting Up Working Environment

Let’s start by setting up the working environments. First, we will download the packages we will use in this article. We will download the transformers package and the datasets packages from HuggingFace to be able to download the model and the dataset we will work with.

We will also download Soundfile and librosa to be able to work with sound files in Python. The librosa library may need to have ffmpeg installed. This page on librosa provides installation instructions for FFmpeg.

Finally, we will download the Gardio package that we will use by the end of this article to demo and deploy the application we will build throughout this application.

!pip install transformers
!pip install -U datasets
!pip install soundfile
!pip install librosa
!pip install gradio

Now that the packages and libraries we will use are ready, let's download the dataset we will use in this article.

2. Preparing the Dataset

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from reading audiobooks from the LibriVox project and has been carefully segmented and aligned.

We will load the dataset from the Hugging Face dataset hub using the command below:

from datasets import load_dataset
dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)

Next, let's explore the first three examples of the dataset using the following code:

example = next(iter(dataset))
dataset_head = dataset.take(5)
list(dataset_head)

[{‘file’: ‘374–180298–0000.flac’,
‘audio’: {‘path’: ‘374–180298–0000.flac’,
‘array’: array([ 7.01904297e-04, 7.32421875e-04, 7.32421875e-04, …,
-2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
‘sampling_rate’: 16000},
‘text’: ‘CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED’,
‘speaker_id’: 374,
‘chapter_id’: 180298,
‘id’: ‘374–180298–0000’},
{‘file’: ‘374–180298–0001.flac’,
‘audio’: {‘path’: ‘374–180298–0001.flac’,
‘array’: array([-9.15527344e-05, -1.52587891e-04, -1.52587891e-04, …,
-2.13623047e-04, -1.83105469e-04, -2.74658203e-04]),
‘sampling_rate’: 16000},
‘text’: “MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS’S LIFE I CHANGED MY OWN”,
‘speaker_id’: 374,
‘chapter_id’: 180298,
‘id’: ‘374–180298–0001’},
{‘file’: ‘374–180298–0002.flac’,
‘audio’: {‘path’: ‘374–180298–0002.flac’,
‘array’: array([-2.44140625e-04, -2.44140625e-04, -1.83105469e-04, …,
1.83105469e-04, 3.05175781e-05, -1.52587891e-04]),
‘sampling_rate’: 16000},
‘text’: ‘I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM’,
‘speaker_id’: 374,
‘chapter_id’: 180298,
‘id’: ‘374–180298–0002’}]

You can also see any of the examples using the code below:

list(dataset_head)[2]

{‘file’: ‘374–180298–0002.flac’,
‘audio’: {‘path’: ‘374–180298–0002.flac’,
‘array’: array([-2.44140625e-04, -2.44140625e-04, -1.83105469e-04, …,
1.83105469e-04, 3.05175781e-05, -1.52587891e-04]),
‘sampling_rate’: 16000},
‘text’: ‘I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM’,
‘speaker_id’: 374,
‘chapter_id’: 180298,
‘id’: ‘374–180298–0002’}

You can also listen to one of the examples using the code below:

from IPython.display import Audio as IPythonAudio

IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

3. Build Transformers Pipeline

Now after the dataset is downloaded, we will build a Transformers pipeline and load the whisper model that will do the speech recognition. We will use the distilled version of Whisper.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022. It is capable of transcribing speech in English and several other languages and is also capable of translating several non-English languages into English.

The Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller and performs within a 1% word error rate (WER) on out-of-distribution evaluation sets.

Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill Whisper in other languages.

from transformers import pipeline
asr = pipeline(task="automatic-speech-recognition",
               model="./models/distil-whisper/distil-small.en")

Next, we will retrieve the sampling rate required by the feature extractor component of the ASR pipeline, to ensure that any audio data processed by this pipeline matches the expected format and sampling rate.

asr.feature_extractor.sampling_rate

16000

The sampling rate expected by the speech recognition model is 16000. We have to make sure that the data sampling rate is the same, or else we will need to apply transformation for it.

example['audio']['sampling_rate']

16000

We can see that both have the same sampling rate, so there is no need to apply any transformation to the data. Finally, let's deploy the model using Gardio.

4. Deploy Application Demo with Gradio

Gradio is an open-source Python package that allows you to quickly build a demo or web application for your machine learning model, API, or any arbitrary Python function. You can then share a link to your demo or web application in just a few seconds using Gradio’s built-in sharing features.

Let's start first by importing the Gradio package and creating an instance of the Blocks class. Blocks is a class in Gradio that allows for the creation of complex web applications by defining a layout consisting of various interactive components (e.g., buttons, sliders, text boxes) and arranging them in blocks.

import os
import gradio as gr
demo = gr.Blocks()

Next, we will define the transcribe_speech function. The transcribe_speech function validates the audio file input, processes it using an ASR pipeline to transcribe speech, and returns the transcribed text. This function can be seamlessly integrated into a Gradio interface to provide an easy-to-use web application for speech-to-text transcription.

def transcribe_speech(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(filepath)
    return output["text"]

This code sets up a Gradio interface that allows users to transcribe speech captured from their microphone. When the user records audio, it is saved as a file, and the path to this file is passed to the transcribe_speech function. The transcribed text is then displayed in a textbox labeled "Transcription".

mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

Next, we will set up a Gradio interface that allows users to transcribe speech captured from their microphone. When the user records audio, it is saved as a file, and the path to this file is passed to the transcribe_speech function. The transcribed text is then displayed in a textbox labeled "Transcription".

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

Finally, we will create a Gradio-based web application that has two main functionalities:

Transcribing audio from a microphone.
Transcribing uploaded audio files.

It uses a tabbed interface to switch between these two functionalities. The application is launched with sharing enabled, and it listens on a specific port defined by an environment variable. This setup is ideal for providing an interactive and user-friendly interface for speech-to-text transcription tasks.

with demo:74
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )

demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

The final output will look like this: