Training GPT-2 From Scratch: A Step-by-Step Guide

Step-by-Step Guide to Training GPT-2 from Scratch

Aug 19, 2024

∙ Paid

The GPT-2 model, a transformer-based language model developed by OpenAI, is renowned for its ability to generate coherent and contextually relevant text. Training GPT-2 from scratch is an excellent practice for those looking to deepen their understanding of natural language processing and model training techniques. There are three critical components that play a pivotal role: dataset selection, model configuration, and the execution of the training loop.

This article provides a comprehensive, step-by-step guide to mastering these essential steps. It begins with the selection of a dataset that aligns with your specific use case, followed by the careful configuration of the model’s architecture, tailored to your available resources.

The process culminates in the execution of the training loop, where all elements converge to effectively train the model. By following this guide, readers will gain a solid foundation in training their own language model, from data loading and architecture definition to scaling, training, and inference.

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.

Read full story

1. Setting Up Working Environments

We will start with installing the packages we will work with in this article:

Transformers: For working with transformer-based models like GPT-2.
DeepLake: For managing large datasets.
WandB: For experiment tracking.
Accelerate: For optimizing and speeding up model training.

!pip install -q transformers==4.32.0 deeplake==3.6.19 wandb==0.15.8 accelerate==0.22.0

Next, we will log in to Weight and Bias for the sake of reporting. You will need to have an account there and provide an API key.

!wandb login

You will need to use an 8x NVIDIA A100 instance comprising 40GB of memory for around 40 hours to fully train the model with the

2. Load Dataset from Deep Lake

During the pre-training process, we will use the Activeloop datasets to stream the samples seamlessly, batch by batch. This approach proves beneficial for resource management as loading the entire dataset directly into memory is unnecessary.

Consequently, it greatly helps in optimizing resource usage. You can quickly load the dataset, and it automatically handles the streaming process without requiring any special configurations.

We will start by loading the openwebtext dataset, a collection of Reddit posts with at least three upvotes. This dataset is well-suited for acquiring broad knowledge to build a foundational model for general purposes.

The code below will instantiate a dataset object capable of retrieving the data points for both training and validation sets. Afterward, we can print the variable to examine the dataset’s characteristics.

import deeplake

ds = deeplake.load('hub://activeloop/openwebtext-train')
ds_val = deeplake.load('hub://activeloop/openwebtext-val')

print(ds)
print(ds[0].text.text())

Dataset(path=’hub://activeloop/openwebtext-train’, read_only=True, tensors=[‘text’, ‘tokens’])
“An in-browser module loader configured to get external dependencies directly from CDN. Includes babel/typescript. For quick prototyping, code sharing, teaching/learning — a super simple web dev environment without node/webpack/etc.\n\nAll front-end libraries\n\nAngular, React, Vue, Bootstrap, Handlebars, and jQuery are included. Plus all packages from cdnjs.com and all of NPM (via unpkg.com). Most front-end libraries should work out of the box — just use import / require() . If a popular library does not load, tell us and we’ll try to solve it with some library-specific config.\n\nWrite modern javascript (or typescript)\n\nUse latest language features or JSX and the code will be transpiled in-browser via babel or typescript (if required). To make it fast the transpiler will start in a worker thread and only process the modified code. Unless you change many files at once or open the project for the first time, the transpiling should be barely noticeable as it runs in parallel with loading a…”

The returned data consists of two tensors: text containing the textual input and tokens representing the tokenized version of the content. We can also index through the dataset and access each column by using .text and convert the row to textual format by calling the .text() method.

The next step will be crafting a PyTorch Dataset class that leverages the loader object and ensures compatibility with the framework. The Dataset class handles both dataset formatting and any desired preprocessing steps to be applied. In this instance, our objective is to tokenize the samples. We will load the GPT-2 tokenizer model from the Transformers library to achieve this.

For this specific model, we need to set a padding token (which may not be required for other models), and for this specific purpose, we have chosen to utilize the end of sentence eos_token to set the loaded tokenizer’s pad_token method.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

Next, we will create dataloaders from the Deep Lake datasets. In doing so, we also specify a transform that tokenizes the texts of the dataset on the fly.

# define transform to tokenize texts
def get_tokens_transform(tokenizer):
    def tokens_transform(sample_in):
        tokenized_text = tokenizer(
            sample_in["text"],
            truncation=True,
            max_length=512,
            padding='max_length',
            return_tensors="pt"
        )
        tokenized_text = tokenized_text["input_ids"][0]
        return {
            "input_ids": tokenized_text,
            "labels": tokenized_text
        }
    return tokens_transform

# create data loaders
ds_train_loader = ds.dataloader()\
    .batch(32)\
    .transform(get_tokens_transform(tokenizer))\
    .pytorch()

ds_eval_train_loader = ds_val.dataloader()\
    .batch(32)\
    .transform(get_tokens_transform(tokenizer))\
    .pytorch()

It is important to note that we have formatted the dataset so that each sample is comprised of two components: input_ids and labels. input_ids are the tokens the model will use as inputs, while labels are the tokens the model will try to predict.

Currently, both keys contain the same tokenized text. However, the trainer object from the Transformers library will automatically shift the labels by one token, preparing them for training.

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

To Data & Beyond