Step-by-Step Guide to Fine-Tuning Large Language Models for Summarization

Perform LLM Full Fine-Tuning & Parameter Efficient Fine-Tuning for Summarization Task

Jul 14, 2024

∙ Paid

Large Language Models (LLMs) have been demonstrating remarkable capabilities across various tasks for the last two years. However, to optimize their performance for specific applications, such as summarization, fine-tuning is often necessary. Fine-tuning adapts a pre-trained LLM to a particular domain or task, allowing it to generate more accurate and relevant outputs.

This tutorial begins by walking readers through the process of setting up their working environment, including downloading necessary dependencies and loading the required dataset and LLM. It then demonstrates how to test the model using zero-shot inferencing, establishing a baseline for comparison.

The guide proceeds to explore two distinct fine-tuning methodologies. First, it delves into full fine-tuning, covering dataset preprocessing, model training, and both qualitative and quantitative evaluation techniques. The evaluation process incorporates human assessment and the ROUGE metric to gauge the model’s summarization capabilities.

Subsequently, the post introduces Parameter Efficient Fine-Tuning (PEFT), focusing on the LoRA (Low-Rank Adaptation) method. Readers learn how to set up and train a PEFT adapter, followed by similar evaluation procedures to assess its performance.

By the end of this tutorial, readers will have gained hands-on experience in fine-tuning LLMs for summarization tasks, and understanding the nuances of both full fine-tuning and PEFT approaches. This knowledge will enable them to make informed decisions when optimizing language models for specific summarization applications.

Setting Up Working Environment & Getting Started
1.1. Download & Import Required Dependencies
1.2. Load Dataset and LLM
1.3. Test the Model with Zero Shot Inferencing
Perform Full Fine-Tuning
2.1. Preprocess the Dialog-Summary Dataset
2.2. Fine-Tune the Model with the Preprocessed Dataset
2.3. Evaluate the Model Qualitatively (Human Evaluation)
2.4. Evaluate the Model Quantitatively (with ROUGE Metric)
Perform Parameter Efficient Fine-Tuning (PEFT)
3.1. Setup the PEFT/LoRA model for Fine-Tuning
3.2. Train PEFT Adapter
3.3. Evaluate the Model Qualitatively (Human Evaluation)
3.4. Evaluate the Model Quantitatively (with ROUGE Metric)

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs. The content of the book covers the following topics:

Read full story

1. Setting Up Working Environment & Getting Started

1.1. Download & Import Required Dependencies

The first step in setting up the working environment is to install the packages and frameworks we will be using in the tutorial.

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Next, we will import the main packages we will use in this tutorial

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

1.2. Load Dataset and LLM

We are going to experiment with the DialogSum Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

Next, we load the pre-trained FLAN-T5 model and its tokenizer directly from HuggingFace. We will be using the small version of FLAN-T5. Setting torch_dtype=torch.bfloat16 specifies the memory type to be used by this model.

model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%

Next, we will test the model on the summarization task without any fine-tuning.

1.3. Test the Model with Zero Shot Inferencing

We first select a specific test example from a dataset at a given index (200 in this case) and extract the ‘dialogue’ and ‘summary’ fields. A prompt is created by inserting the dialogue into a string template asking for a summary.

This prompt is then tokenized and passed to the model to generate a summary, with a maximum of 200 new tokens. The generated output is decoded into a human-readable format, and both the original prompt, the baseline human-written summary, and the model-generated summary are printed out, and separated by dashed lines for clarity.

index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
INPUT PROMPT:
Summarize the following conversation.
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I’m not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You’d probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Summary:
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#’s system.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
MODEL GENERATION — ZERO SHOT:
#Person1#: I’m thinking of upgrading my computer.

When we test the model with the zero-shot inferencing, we can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand. To improve the results we will perform full fine-tuning for the model and compare the results.

2. Perform Full Fine-Tuning

2.1. Preprocess the Dialog-Summary Dataset

To fine-tune the model we will need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with Summarize the following conversation and to the start of the summary with Summary as follows:

Training prompt (dialogue):
Summarize the following conversation.
Chris: This is his part of the conversation.
Antje: This is her part of the conversation.

Summary:
Training response (summary):
Both Chris and Antje participated in the conversation.

Then preprocess the prompt-response dataset into tokens and pull out their input_ids (1 per token).

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Now let's check the shapes of all three parts of the dataset but we will first take a subset of the data to save time.

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
train: Dataset({
features: [‘input_ids’, ‘labels’],
num_rows: 125
})
test: Dataset({
features: [‘input_ids’, ‘labels’],
num_rows: 15
})
validation: Dataset({
features: [‘input_ids’, ‘labels’],
num_rows: 5
})
})

Now the dataset is ready for fine-tuning so let's jump directly to fine-tuning the model with this processed data.

2.2. Fine-Tune the Model with the Preprocessed Dataset

Now we will utilize the built-in Hugging Face Trainer class. We will pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Now we are ready to fine-tune the model with this simple command

trainer.train()

The final step is to save the fine-tuned model and load it after that

# Save the trained model
trained_model_dir = "./trained_model"
trainer.save_model(trained_model_dir)

# Load the trained model
trained_model = AutoModelForSeq2SeqLM.from_pretrained(trained_model_dir)

Now let's evaluate the fine-tuned model qualitatively using human evaluation metrics and quantitatively using the ROUGE metric.

To Data & Beyond