Building Text-to-Speech Systems Using VITS & ArTST Models

Jun 03, 2024

Text-to-speech (TTS) is a technology that converts written text into spoken words. This task involves generating natural-sounding speech from text input, allowing computers to “read” text aloud.

However, in classification tasks, there is typically only one correct label, or sometimes a few. In automatic speech recognition (ASR), a single correct transcription corresponds to a given utterance.

However, there are countless ways to articulate the same sentence, with variations in voices, dialects, and speaking styles. Despite these challenges, some open-source models excel at this task. We will use two of them: the VITS pre-trained model from Kakao Enterprise to convert English text into speech, as well as the speecht5_tts_clartts_ar model from Mubazi to convert Arabic text into speech.

My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni

September 15, 2023

My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. Setting up the Working Environment

Let’s start by setting up the working environments. First, we will download the packages we will use in this article. We will download the transformers package and the datasets packages from HuggingFace to be able to download the model and the dataset we will work with.

    !pip install transformers
    !pip install -U datasets
    !pip install timm
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Now that the packages and libraries we will use are ready, let's import them:

from transformers import pipeline
from datasets import load_dataset
from IPython.display import Audio as IPythonAudio
import soundfile as sf
import torch

You can check the text-to-speech model hub on Hugging Face so you can the model that is suitable for your system:

2. English Text to Speech using the VITS Model

Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.

A set of spectrogram-based acoustic features is predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers.

The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesize speech with different rhythms from the same input text.

The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution.

During inference, the text encodings are up-sampled based on the duration prediction module and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder.

Due to the stochastic nature of the duration predictor, the model is non-deterministic and thus requires a fixed seed to generate the same speech waveform.

There are two variants of the VITS model: one is trained on the LJ Speech dataset, and the other is trained on the VCTK dataset. LJ Speech dataset consists of 13,100 short audio clips of a single speaker with a total length of approximately 24 hours.

The VCTK dataset consists of approximately 44,000 short audio clips uttered by 109 native English speakers with various accents. The total length of the audio clips is approximately 44 hours.

To use the VITS model to convert text to speech, we will utilize the Hugging Face pipeline to perform text-to-speech (TTS) using a specific model stored locally (./models/kakao-enterprise/vits-ljs).

The text provided, which discusses the Israeli occupation of Palestine, is passed to the narrator pipeline. The pipeline converts the text into speech, generating audio that narrates the provided text.

The result, stored in the narrated_text variable contains the audio data produced by the model. This allows for the text to be listened to as spoken words, facilitating the accessibility and auditory presentation of the information.

narrator = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

text = """
The Israeli occupation of Palestine began in 1967 
during the Six-Day War when Israel captured the West Bank, 
Gaza Strip, and East Jerusalem. 
These areas, home to many Palestinians, have since been a 
focal point of conflict. The international community generally views Israeli settlements there as illegal. 
Efforts towards peace continue, with Palestinians seeking 
independence and Israelis seeking security. 
The situation remains highly complex and contentious.
"""

narrated_text = narrator(text)

You can then save this speech as a .wav file or you can directly listen to it in your Jyputer notebook:

from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

3. Arabic Text to Speech using ArTST

ArTST is a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture in this first edition follows the unified-modal framework, SpeechT5, that was recently released for English and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions.

The model is pre-trained from scratch on MSA speech and text data and fine-tuned for the following tasks: Automatic Speech Recognition (ASR), TTS, and spoken dialect identification. SpeechT5 for Arabic (TTS task) is a pre-trained weight from ArTST and fine-tuned using the huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech).

To use this model to convert text to speech we will use the Hugging Face pipeline to perform a text-to-speech (TTS) task with a specific model (MBZUAI/speecht5_tts_clartts_ar).

We will also load speaker embeddings from a dataset (herwoww/arabic_xvector_embeddings) and selects a particular embedding to simulate a specific speaker's voice.

The selected text, which describes the Israeli occupation of Palestine, is converted to speech using this embedding. The generated speech audio is then saved to a file called "speech.wav" with the specified sample rate. The TTS model generates speech without diacritics, focusing on the natural pronunciation of the text.

synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.
text = """
بدأ الاحتلال الإسرائيلي لفلسطين في عام 1967 خلال حرب الأيام الستة عندما 
احتلت إسرائيل الضفة الغربية وقطاع غزة والقدس الشرقية. أصبحت هذه المناطق، التي يعيش فيها العديد من الفلسطينيين، محورًا للصراع منذ ذلك الحين. 
يرى المجتمع الدولي عمومًا أن المستوطنات الإسرائيلية هناك غير قانونية. 
تستمر الجهود نحو السلام، حيث يسعى الفلسطينيون إلى الاستقلال ويسعى الإسرائيليون إلى الأمن. 
لا تزال القضية معقدة للغاية ومثيرة للجدل.
"""
speech = synthesiser(text, forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

My E-book: Data Science Portfolio for Success Is Out!

Discussion about this post

To Data & Beyond

Building Text-to-Speech Systems Using VITS & ArTST Models

Table of Contents:

My E-book: Data Science Portfolio for Success Is Out!

1. Setting up the Working Environment

2. English Text to Speech using the VITS Model

3. Arabic Text to Speech using ArTST

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Discussion about this post