Instruction Fine-Tuning LLM using SFT for Financial Sentiment: A Step-by-Step Guide

Sep 15, 2024

∙ Paid

Instruction fine-tuning allows large language models (LLMs) to be adapted for specific tasks by guiding them with clear, task-oriented instructions. In this article, we focus on fine-tuning the facebook/opt-1.3b model for financial sentiment analysis using Supervised Fine-Tuning (SFT).

The process begins with setting up the environment and loading a financial dataset from Deep Lake, followed by the initialization of the model and training configuration.

We also explore how LoRA (Low-Rank Adaptation) can be combined with the OPT model to make the fine-tuning process more efficient. Finally, we demonstrate how to run inference, applying the fine-tuned model to real-world financial data.

This guide will be helpful for machine learning practitioners, data scientists, and developers who want to fine-tune large language models for domain-specific applications, especially in finance. Whether you’re new to model fine-tuning or looking for a hands-on approach to adapting models for sentiment analysis, this article covers the essential steps.

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.

Read full story

1. Setting-Up Working Environment

The first step is to set up the working environment for our project and we will start with installing the packages and key libraries that we will be using to instruction fine-tuning the LLM.

!pip install -q transformers==4.32.0 
deeplake==3.6.19 
trl==0.6.0
peft==0.5.0 
wandb==0.15.8

!pip install -q: This is a command to install Python packages. The -q flag makes the installation process quieter, showing less output.
transformers==4.32.0: This is the Hugging Face Transformers library. It's a popular tool for working with pre-trained language models. You're using a specific version (4.32.0) to ensure compatibility.
deeplake==3.6.19: Deeplake is a data lake for machine learning. It helps manage and version your datasets. Again, you're specifying version 3.6.19.
trl==0.6.0: This stands for Transformer Reinforcement Learning. It's a library that helps with fine-tuning language models using techniques like reinforcement learning.
peft==0.5.0: PEFT stands for Parameter-Efficient Fine-Tuning. It's a library that provides methods for fine-tuning large language models more efficiently.
wandb==0.15.8: This is Weights & Biases, a tool for tracking and visualizing machine learning experiments.

The next step will be loading the dataset you will be using to fine-tune the model.

2. Load the Deep Lake Dataset

For our project, we will use the FinGPT sentiment dataset. This dataset is a goldmine of financial tweets, each paired with a sentiment label. What makes it even more interesting is the instruction column. This column sets the stage for each data point, typically asking something like "What's the sentiment of this text? Pick from Positive, Negative, or Neutral."

Now, we could work with the full dataset, but for the sake of speed, we’re using a smaller slice. This subset comes from Activeloop’s collection of free public datasets, all neatly packaged in Deep Lake format. Our training set has 20,000 examples, with another 2,000 set aside for validation. It’s a good balance between having enough data to learn from and keeping our fine-tuning runs manageable.

We will use the deeplake.load() function, we can create the Dataset object and load the samples.

import deeplake

# Connect to the training and testing datasets
ds = deeplake.load('hub://genai360/FingGPT-sentiment-train-set')
ds_valid = deeplake.load('hub://genai360/FingGPT-sentiment-valid-set')
ds

Dataset(path=’hub://genai360/FingGPT-sentiment-valid-set’, read_only=True, tensors=[‘input’, ‘instruction’, ‘output’])

Next, we will define the prepare_sample_text function which takes a single data point from our FinGPT dataset and combines the instruction, the financial tweet content, and the sentiment label into a structured string.

This formatting creates a consistent pattern for our language model to learn from, clearly separating the task instruction, input text, and expected output.

By applying this function to each example in our dataset, we’re setting up our model to effectively learn the relationship between financial text and sentiment labels. This standardized format helps maintain context and should improve our model’s ability to understand and predict financial sentiment.

def prepare_sample_text(example):
    """Prepare the text from a sample of the dataset."""
    text = f"{example['instruction'].text()}\n\nContent: {example['input'].text()}\n\nSentiment: {example['output'].text()}"
    return text

Here is a formatted input derived from an entry in the dataset.

What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}
Content: Diageo Shares Surge on Report of Possible Takeover by Lemann
Sentiment: positive

Next, we will set up the tokenizer for our language model. We’re using the AutoTokenizer class from the Transformers library to load a pre-trained tokenizer specifically designed for the OPT-1.3B model, created by Facebook (now Meta).

The tokenizer’s job is to convert our text data into a format the model can understand — breaking down words and phrases into tokens. By using “facebook/opt-1.3b”, we’re ensuring our tokenizer matches the vocabulary and encoding scheme of the pre-trained model we’ll be fine-tuning.

This step is crucial because it allows us to properly prepare our financial text data for input into the model, maintaining consistency between how the original model was trained and how we’ll be using it for our sentiment analysis task.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

The next step sets up our training dataset using the ConstantLengthDataset class from the TRL library. It takes our tokenizer, the raw dataset (ds), and our previously defined formatting function (prepare_sample_text).

The ConstantLengthDataset ensures all input sequences have a consistent length of 1024 tokens, padding or truncating as needed. This uniformity is important for efficient batch processing during training.

The ‘infinite=True’ parameter allows the dataset to be sampled indefinitely, which is useful for long training runs. Essentially, this code transforms our raw financial text data into a format that’s ready for input into our language model, handling the nitty-gritty details of tokenization and sequence length management.

from trl.trainer import ConstantLengthDataset

train_dataset = ConstantLengthDataset(
    tokenizer,
    ds,
    formatting_func=prepare_sample_text,
    infinite=True,
    seq_length=1024
)

Finally, we can use the code below to have a peek at what our prepared dataset looks like

iterator = iter( train_dataset )
sample = next( iterator )
print( sample )

{‘input_ids’: tensor([50118, 35212, 8913, …, 2430, 2, 2]),’labels’: tensor([50118, 35212, 8913, …, 2430, 2, 2])

The final step is to apply the ConstantLengthDataset on the validation dataset as we have done with the training dataset.

eval_dataset = ConstantLengthDataset(
    tokenizer,
    ds_valid,
    formatting_func=prepare_sample_text,
    seq_length=1024
)

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

To Data & Beyond

Instruction Fine-Tuning LLM using SFT for Financial Sentiment: A Step-by-Step Guide

Table of Contents:

My New E-Book: LLM Roadmap from Beginner to Advanced Level

1. Setting-Up Working Environment

2. Load the Deep Lake Dataset

Keep reading with a 7-day free trial