This data was originally featured in the November 8, 2023 newsletter found here: https://www.trustinsights.ai/blog/2023/11/inbox-insights-november-8-2023-solving-problems-with-predictive-analytics-testing-ai-models/.
This week, let’s talk about setting up testing of large language models. Testing models is important to determine which models will do what you want them to do. Before we dive headfirst into any new technology, it’s a good idea to know if it’s a good fit for our organization and our desired outcome. So how do we do this? As always, we start with the Trust Insights 5P framework, the best place to start for any new project.
Purpose: the purpose of model testing is to determine whether a model is suitable for a task specific to your organization or not.
People: who’s doing the testing? This particular testing method is suitable for people of any skill level as long as they’re capable of basic digital tasks such as copying and pasting. However, what’s more important is who you are testing on behalf of. If it’s yourselves, that’s easy enough. If it’s on behalf of others in your organization, then that will dictate your testing conditions.
Process: what methodology will you use? Testing models is like QA testing anything else – we want to keep a detailed log of what we’re testing, how we’re testing it, and what the results are. A shared spreadsheet of some kind is useful for the organization to compare results. We’ll also want to develop methods for testing that others can use so that they can test models in their own roles; the test that someone in sales uses should differ from the test that someone in HR uses. We also should ensure we’re using a consistent method for testing; the RACE framework is a good place to start.
Platform: which models you test will be governed largely by which tools you have access to. Almost everyone has access to at least one free large language model, such as Claude 2, GPT-3.5, Bard, Bing, or open-source models like the LLaMa 2 family. For the purposes of testing, you want to use as many as practical and as many as your team has access to.
Performance: which measures will we test? There are tons of scientific measures of large language models, such as the Stanford HELM benchmark, but many of these tests don’t measure real-world results. For any given task, we’d want to track things like:
- Who ran the test?
- When did they run the test?
- What model did they test?
- What category of task was it?
- Was the model able to complete the task?
- How thorough was the model output?
- How correct was the model output?
- What was the prompt used?
- What was the output?
Let’s look at a practical example of this. Suppose we want to use a large language model to write a decline notice for unsolicited pitches on LinkedIn. We’d start with a base prompt that goes like this:
You are a world class executive assistant to TrustInsights.ai Chief Data Scientist Christopher Penn. You have knowledge of executive communications, delegation, administration, executive assistance, administrative assistance. You are tough, firm, and brook neither dissent nor argument. You speak in a professional tone of voice. Your primary directive is to protect the time of your manager, Christopher. Your name is Grace Parker Thompson, executive assistant to Christopher Penn. Reply to this message declining the services offered and instruct them not to contact Christopher again.
We then use one of the many, many, many unsolicited pitches we all get on LinkedIn. The names have been changed to anonymize the original sender.
Subject: PMP training
Wished to check if you would be interested in achieving PMP –Project Management Professional training for your career enhancement.
•4 Day full-time intensive PMP Training. •35 PDUs offered – PMI Approved •100 Percent Money Back Guarantee •1000+ PMP exam prep practice questions. •10+ years highly experienced trainers •Course completion certificate
If you are interested in this course can I have your Email ID / Phone Number ?
Look forward to hearing from you.
This is a consistent prompt and outcome to test. We have a good idea of what we want – a firm but professional decline of the proffered service.
We’ll test using several of the OpenAI family of models first:
And then we can add in a test using an open source model, LLaMa 2:
What we see is that the GPT-4 and LLaMa 2 models performed about the same:
GPT-4-Turbo was slightly more thorough. GPT-3.5-Turbo was a significantly less complete, less thorough response.
With these test results, we can look back at our 5P framework and ascertain that all these models were able to complete the task, but some did better than others. We’d go back now and revise our user stories if we wanted to take into account things like cost – some models are more expensive to operate than others. But the key takeaway is that a clear, complete testing plan based on the 5P framework will let us do apples to apples comparisons of AI and find the right tool for the job.
Need help with your marketing AI and analytics?
You might also enjoy:
Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday!
Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new episodes every Wednesday.