Building Text Translation System using Meta NLLB Open-Source Model

May 04, 2024

∙ Paid

Hugging Face is a platform that hosts a treasure trove of open-source models, making it a goldmine for anyone diving into the world of natural language processing.

In this guide, we will explore how to use Metas NLLB Open-Source Model through the HuggingFace Transformers package for machine translation tasks and will try it on different Arabic accents to see how it performs.

My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni

September 15, 2023

My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. Setting Up Working Environment

In this article, we will use the Transformers library, particularly the Pipeline function. You can install it first if you have not yet before using the command below:

    !pip install transformers 
    !pip install torch

Next, I will import the pipeline function from the Transformers library and also the torch.

from transformers import pipeline 
import torch

Now we have everything we need to create your machine translation system.

2. Build a Translator Pipeline using HuggingFace Transformers

The second step will be building a translator pipeline using huggingface transformers. We will be using Meta's open-source machine translation model No Language Left Behind model.

No Language Left Behind-200 (NLLB-200) is a machine translation model primarily intended for research in machine translation, — especially for low-resource languages. It allows for single-sentence translation among 200 languages. NLLB-200 is a research model and has not been released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain-specific texts, such as medical domain or legal domain.

The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.

Let's define the translator pipeline we will use the NLLB-200’s distilled 600M variant version of the model:

To Data & Beyond

Building Text Translation System using Meta NLLB Open-Source Model

Table of Contents:

My E-book: Data Science Portfolio for Success Is Out!

1. Setting Up Working Environment

2. Build a Translator Pipeline using HuggingFace Transformers

This post is for paid subscribers