Testing Prompt Engineering-Based LLM Applications

Hands-On Prompt Engineering for LLMs Application Development

May 29, 2024

∙ Paid

Once such a system is built, how can you assess its performance? As you deploy it and users interact with it, how can you monitor its effectiveness, identify shortcomings, and continually enhance the quality of its responses?

In this article, we will explore and share best practices for evaluating LLM outputs and provide insights into the experience of building these systems. One key distinction between this approach and traditional supervised machine learning applications is the speed at which you can develop LLM-based applications.

As a result, evaluation methods typically do not begin with a predefined test set; instead, you gradually build a set of test examples as you refine the system.

Testing LLMs vs Testing Supervised Machine Learning ModelsEvaluating LLM Outputs: Best Practices
1.1. Incremental Development of Test Sets
1.2. Automating Evaluation Metrics
1.3. Scaling Up: From Handful to Larger Test Sets
1.4. High-Stakes Applications and Rigorous Testing
Case Study: Product Recommendation System
Handling Errors and Refining Prompts
Refining Prompts: Version 2
Testing and Validating the New Prompt
Automating the Testing Process
Further Steps: Iterative Tuning and Testing
Conclusion

My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni

September 15, 2023

My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. Testing LLMs vs. Testing Supervised Machine Learning Models

In the traditional supervised learning approach, collecting an additional 1,000 test examples when you already have 10,000 labeled examples isn’t too burdensome.

It’s common in this setting to gather a training set, a development set, and a test set, using them throughout the development process. However, when working with large language models (LLMs), you can specify a prompt in minutes and get results in hours. This makes pausing to collect 1,000 test examples a significant inconvenience, as LLMs don’t require initial training examples to start working.

1.1. Incremental Development of Test Sets

Building an application with an LLM often begins by tuning the prompts on a small set of examples, typically between one and five. As you continue testing, you’ll encounter tricky examples where the prompt or algorithm fails.

In these failure cases, you can add these difficult examples to your growing development set. Eventually, manually running every example through the prompt becomes impractical each time you make a change.

1.2. Automating Evaluation Metrics

At this stage, you develop metrics to measure performance on your small set of examples, such as average accuracy. An interesting aspect of this process is that if your system is working well enough at any point, you can stop and avoid further steps.

Many deployed applications stop at this stage and perform adequately. However, if your hand-built development set doesn’t instill sufficient confidence in your system’s performance, you may need to collect a randomly sampled set of examples for further tuning.

This set continues to serve as a development or hold-out cross-validation set, as it’s common to keep tuning your prompt against it.

1.3. Scaling Up: From Handful to Larger Test Sets

If you require a higher fidelity estimate of your system’s performance, you might collect and use a hold-out test set that you do not look at while tuning the model.

This step is crucial when your system is achieving 91% accuracy and you aim to reach 92% or 93%. Measuring such small performance differences necessitates a larger set of examples.

To get an unbiased, fair estimate of your system’s performance, you’ll need to go beyond the development set and collect a separate hold-out test set.

1.4. High-Risk Applications and Rigorous Testing

For many applications of LLMs, there is minimal risk of harm if the model provides a slightly incorrect answer. However, in high-risk applications where there is a risk of bias or harmful outputs, it is crucial to rigorously evaluate your system’s performance before deployment.

In these cases, collecting a comprehensive test set is necessary to ensure the system performs correctly. Conversely, if you’re using the LLM for low-risk tasks, such as summarizing articles for personal use, you can afford to stop early in the process without the expense of collecting larger data sets for evaluation.

2. Case Study: Product Recommendation System

Let’s take a case study in which we will build a product recommendation system based on the input query from the user. We will use the OpenAI Python library to access the OpenAI API. You can use this Python library using pip like this:

pip install openai

Next, we will import OpenAI and then set the OpenAI API key which is a secret key. You can get one of these API keys from the OpenAI website. It is better to set this as an environment variable to keep it safe if you share your code. We will use OpenAI’s chatGPT GPT 3.5 Turbo model, and the chat completions endpoint.

import os
import openai
import sys
sys.path.append('../..')
import utils
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key  = os.environ['OPENAI_API_KEY']

Finally, we will define a helper function to make it easier to use prompts and look at generated outputs. So that’s this function, get_completion, that just takes in a prompt and will return the completion for that prompt.

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

We then will use the utils function to get a list of products and categories. You can see that there is a list of categories and for each category, there is a list of products.

So, in the computers and laptops category, there’s a list of computers and laptops, in the smartphones and accessories category, here’s a list of smartphones and accessories, and so on for other categories.

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

To Data & Beyond