Hugging Face is a platform that hosts a treasure trove of open-source models, making it a goldmine for anyone diving into the world of natural language processing.
In this guide, we will explore how to use Metas NLLB Open-Source Model through the HuggingFace Transformers package for machine translation tasks and will try it on different Arabic accents to see how it performs.
Table of Contents:
Setting Up Working Environment
Build a Translator Pipeline using HuggingFace Transformers
Translating from English to Arabic with Different Accents
My E-book: Data Science Portfolio for Success Is Out!
I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?
1. Setting Up Working Environment
In this article, we will use the Transformers library, particularly the Pipeline function. You can install it first if you have not yet before using the command below:
!pip install transformers
!pip install torch
Next, I will import the pipeline function from the Transformers library and also the torch.
from transformers import pipeline
import torch
Now we have everything we need to create your machine translation system.
2. Build a Translator Pipeline using HuggingFace Transformers
The second step will be building a translator pipeline using huggingface transformers. We will be using Meta's open-source machine translation model No Language Left Behind model.
No Language Left Behind-200 (NLLB-200) is a machine translation model primarily intended for research in machine translation, — especially for low-resource languages. It allows for single-sentence translation among 200 languages. NLLB-200 is a research model and has not been released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain-specific texts, such as medical domain or legal domain.
The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.
Let's define the translator pipeline we will use the NLLB-200’s distilled 600M variant version of the model: