In-Ear Insights: Measuring ChatGPT Performance

In this week’s In-Ear Insights, Katie and Chris discuss how to set up a testing plan for measuring ChatGPT and whether it is working for you or not. From time savings to risk mitigation, think about the different ways large language models could be improving your business, then develop a measurement plan to see if the tools are delivering on their promises.


Watch the video here:

In-Ear Insights: Measuring ChatGPT Performance

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Christopher Penn 0:00

In this week’s In-Ear Insights, I can’t go a day without seeing 100 different headlines and 100 different posts from marketers all in on chat GPT, which is fine, it’s cool.

Because the tool when you use it properly is highly effective, and you use it improperly, not so much.

But the thing that stuck out to me, Katie, and I want to get your take on this is everyone is all in on this technology, large, large language models with a easy chat interface.

But no one’s really talking about sort of the last part of the five P’s, which is how do we know this thing is working better than we’ve been doing? The danger is, if you’re unsophisticated in the usage of it, it tends to deliver average content because it is mathematically an average of the content that gets ingested.

And so if your content previously was a better than average, and for speeds sake, or for convenience sake, you started having it do literally all your work.

That quality may slip down to average.

And we know from Google’s search quality rating guidelines that one of the most damning sentences in the entire thing for medium quality content is nothing wrong, but nothing special.

So why aren’t people thinking about that final tea? Why are people nothing about how do we measure whether this thing is worth it or not? And how would we measure it?

Katie Robbert 1:30

Well, it’s definitely a shiny object at the moment.

And that’s not to say that it’s not going to stick around it obviously, is, it’s like when, you know, email first started, or the internet first started, or, you know, I think we’ve gone through what, clubhouse and NFT ease and web three, and, you know, all the different, you know, fun new tools, and you know, technologies that people get super excited about, like, I’m with you, I am tired of seeing every single day, everything about chat GPT-2 In one of the you know, communities I saw that has nothing to do with marketing.

Somebody was using chat GPT to write, you know, to ask it to write like a Wikipedia entry about like, what the community was about.

I’m like, why can I not escape this thing? You know, a friend of mine, her husband is using chat GPT like every two seconds to like, you know, write songs and limericks about the types of work that they do, which has no bearing.

And I asked her, does he know that, you know, it actually cost money to run this thing? Like not millions of dollars, but like, he knows that he’s spending money and wasting money.

And so I think that that first in terms of measurement, that’s a question that I asked you a lot, Chris is, what does this cost me? Like, how much are we spending on this thing? And so, you know, we talked, I think we talked about, you know, true ROI last week, and it’s, you know, expenses, and spend and return and all that stuff that goes into it.

And my first question in terms of measuring check GPT-3 is, how much time and money are we spending using this thing, and I would want to sort of have at least a ballpark of what that looks like.

So, you know, Chris, you’ve spent a lot of time, you know, developing and fine tuning prompts, and prompt engineering.

And so that in and of itself takes time, like that’s, you know, your r&d time of, you know, trial and error, but every time you run it through the system, it costs money, you know, not a lot of money, but still cost money.

And so that would be the first thing that I would look at is what is the time we are investing into this thing? How much time are we spending fine tuning? And then all the things that come out? How much of it are we actually using? You know, so that’s part one.

Part two, then is all of the things that we created with Chet GPT.

How much of it is making it into production, the content, the summaries, the headlines, the whatever it is the tweets, how much of that is then making it into production and of those things, depending on the context, whether it’s for awareness or purchase, or, you know, whatever the face? What are those things doing? So it’s not just, at least to me, it’s not as simple as, oh, well, I use chat TPT.

It saved me five minutes.

Well, great, but did it actually do the thing? Because to your point, Chris, if it’s mediocre, then it’s not doing the thing.

It’s not doing better than you could have done without using the system.

Christopher Penn 4:38


And this is where, you know, good project governance is important.

For example, we were doing some client work for client last week on some code.

And that’s an example where I use it heavily right because your your code has to be functional, it has to run.

But there is no special skill advantage for writing 80% of the code.

I was having it right, I was having it write user interface components for a shiny web app.

And it worked great.

It it saved me easily a couple of hours by just generating the pieces, the the couple things that got wrong, I was able to fix immediately because I know how to code that particular app, I just didn’t feel like doing the manual typing out of all the pieces.

So that was an example of where there was a very strong time savings.

This, the service level we’re at is $20 a month.

So is a $20 a month expense.

So it’s a very nominal hard dollar cost is the soft dollar cost, like you said, that can cost that most organizations consumes a decent amount of money on particular if you’re making like Song limericks instead of doing actual work.

For things like YouTube summaries, we use it extensively, right, it’s probably we use it for YouTube summaries which which do not go into production.

We do use it for some transcript cleanup here and there, but not consistently.

The thing that stood out to me when we’re talking about this stuff is that we’re not and this is just industry, why we’re not really looking at a even a simple A B test right here is content created by the service, here’s content created by us, which content performs better.

And you’d want to do this, I would think you’d want to do this in a very straightforward manner have if you have 10 pieces of content, Republican each week, five of them should be machine generated, five of them should be human generated.

And then, after a month’s time, compare the numbers.

Let’s see, okay, we what, what did better? Ideally, they’re all, you know, honor about a related set of topics, so that you can measure the performance of the machines skills within that topic.

And if we were doing this for a client, we would probably do it with something like a binary classification model where the machines writing is the treatment.

And it’s, you know, it’s straight at all biostatistics.

You the treatment is applied to one group, and you have a control group.

And you see, okay, what is the lift? Because when we post content, just like anything in marketing, you’re always gonna get some expected baseline of performance.

So the question is, is there a lift above and beyond what you would have gotten anyway, had just posted content normally.

So that’s, that’s the, the type of testing structure I would like to see that I don’t see folks using.

Katie Robbert 7:27

Well, and I don’t want to go down a, you know, rabbit hole.

But I do want to go back to something you said, Chris, when you were talking about the time savings of chat, GPT, and you were using it to help you write code, you know, had you not been aware of the issues, it actually could have created more work for you wasted more of your time, because I would wager that a lot of people are not fact checking the information that comes out of a system like chat GPT-2.

That’s not to say that, you know, the system is going out of its way to get it wrong.

But it’s not perfect.

It’s just like us, we get things wrong.

And the system is using information generated by humans.

So the information is likely to be wrong from time to time.

And so you know, that you’re relying so heavily on a system like chat GPT, or any, you know, content creation system like that.

You run the risk of actually creating more work for yourself and spending more money and wasting more time.

And so in your case, you said, it saves you a couple of hours, but had you not known where it had gotten the code wrong? It actually could have created another few days of work for you trying to troubleshoot this code.

And so, you know, I, that, again, sort of that goes back to the performance piece of me a like, is it truly time savings.

And so now you’re talking about a B testing to see whether or not the content created by chat GPT is better? I think that that’s an excellent way to test to test the performance and the effectiveness.

You know, I would imagine if you’re using it to write code, you know, is the chat CPT code better than the code created by a human, you know, zero to run, doesn’t even run? I think that these things are excellent ways to test the performance before you say, Okay, we’re going all in.

We’re going to lay off our whole writing stuff and you know, everything’s going to be written by this computer.

You have to be able to measure the performance first because your tone, your personality, the way in which you communicate things is going to take a long time for this machine to learn because you have inherently been you for your whole life in this machine has known you for all of you know, well Chris for you couple years, but that’s still not a long time.

To be able to say I can write exactly the same way that Chris Penn would write.

Christopher Penn 10:05

And here’s the thing about the software.

This is something I talk about a lot in relation to it.

It is best used by subject matter experts.

If you’re going to use it for generation it is best used by subject matter experts to eat who should it not be used by the intern, right? So if you are or the the the lowest paid member on your team, I’m thinking back to our agency days, when all the grunt work was shunted off to the account coordinators, the junior most people with the lowest billing rates.

And there, they were good kids, they were nice people.

They had no professional experience, they were literally right off the boat from college, you know, and having them say, Okay, now you’re gonna go write five news releases on, you know, SAS based firewall appliances.

Of course, it took a lot of time for for the back and forth with the client, because you had somebody who had no idea what they were doing, trying to get smart and write about a topic very quickly, the machine is in many ways very similar to that.

It has a wider body of knowledge and better writing skills than about half of the team members we knew, but it is still not going to include the pieces of subject matter expertise, you would need to use the software well.

And so you know, that has been something that people don’t want to hear that is best used by subject matter experts.

When you look at the chat GPT statistics from from OpenAI.

A lot of people are going 45% of its uses is net new generation, whereas the part the things that’s really good at like summarization and rewriting is only about 10%, even though that’s the best possible use of the tool.

Taking, for example, the words that we are saying on this podcast, we’ve got all sorts of weird language things, we say, you know, or this or right.

And that stuff that doesn’t need to be in the final copy, it’s really good at that, it’s really good at taking that stuff out.

But preserving our unique point of view outwards, it doesn’t have to learn our tone of voice, because it’s just rewriting the content that’s already there to improve it grammatically and structurally.

And yet 45% of people are just using it for that new stuff.

And again, I would strongly caution that, you know, if you’re going to use it for generation, use it with a subject matter expert, otherwise, like you said, you could very well be handing out completely wrong, very plausibly sounding completely wrong information.

Katie Robbert 12:36

Well, it’s like my husband likes to say, if you say anything with authority, people will believe you,

Christopher Penn 12:41

that’s 100% the truth.

And to be fair, he

Katie Robbert 12:45

doesn’t say anything that’s like, you know, damaging or dangerous.

It’s more of, you know, just telling people the wrong information about like, what they’re supposed to be doing that they it’s their job, or just sort of messing with them.

But, you know, it can be a dangerous thing.

I mean, I think it might have been last week or the week before the day is all kind of blur together.

But, you know, the rise of misinformation, being shared across news outlets, because people are using chat GPT.

And they’re not fact checking, and they’re just using it for new content generation to try to get, you know, hundreds of 1000s of pieces of content out so that they’re basically saturating the internet with their brand, their name their byline.

And that is definitely, you know, a huge concern.

And so when you’re talking about measuring performance, you know, you’re talking about the time savings, the cost investment, the AV testing performance, but also, then you’re talking about things like brand reputation, you know, is, is the content or whatever you’re using CAT GPT-2 generate, helping or hurting your brand, because of the type of information that it’s putting out, are you spending the time to make sure that it’s correct and accurate and aligns with, you know, your vision and your mission and your values? Because, Chris, to your point, if if left unchecked, and if you know being used by the wrong resource in your company, that could be hugely problematic, because maybe they the person who’s using it, they’re not the subject matter expert.

And so they don’t understand the nuance of, you know, some of the phrasing or, you know, maybe they don’t understand the history that you may be had with a certain client, so you can’t name them or whatever the situation is.

And so, these things have to have human intervention.

And that has to be part of the performance is, you know, even before it goes out to the production, like the chat GPT-2 even get it right.

And if not, then why are you using it because then you’re just wasting time?

Christopher Penn 14:53


So here’s a fun example on that topic.

I had it say write a description for a ribeye steak cut sold at Whole Foods.

and it comes up with this thing.

So you know, we work with our trusted farmers to ensure our beef sustainably raised, ethically sourced, so you can feel good.

I don’t know if that’s actually true or not.

I would ask you to out your husband.

Katie Robbert 15:13

Fair, like he has no control over that.

But he does work at Whole Foods.

And what I do know is that since being bought out by Amazon, the standards, the food standards have changed.

They aren’t what they were before in terms of you only get you know, food source from like, 20 miles in any direction from the individual store.

That’s just not how it works anymore.

So there’s definitely some inaccuracies in here.

Christopher Penn 15:41


But I don’t know that.

And I, you know, I’m not an I’m not a dummy.

Like I’m a reasonably educated consumer.

This sounds pretty decent.

I mean, we could talk about, you know, marbling scores and stuff like that for a cut of steak.

But again, would this copy sell? The probably, you know, would it be convincing to somebody who’s like me as a layperson probably? isn’t right, no.

And that’s a part of the measurement, right is either just general reputational damage, which is one part, or you’re getting sued? It’s just an entirely different can of worms.

We don’t have time to go on today.

But that’s part of your measurement strategy is if you’re going to use this tool, like any tool, have you done the risk mitigation? Do you have risk mitigation practices in place, so that you can take those factors out of the cost? Because otherwise, you know that that million dollar loss? Who is going to really impact your ROI?

Katie Robbert 16:47

It’s it absolutely what I would I would not be a fan of getting slapped with one of those.

And that’s not to say that humans won’t make the same mistake and put out the same wrong information.

When you introduce a piece of software like chat, GPT, you’re introducing yet another checkpoint, another layer of potential risks with misinformation.

And so these are just, you know, internally, your process should be such that you’re able to have time to do that quality assurance, check on any piece of content, do the fact check, and chat GPT-2 may be able to write a first draft for you.

But if nobody’s checking the information, then let’s say, Chris, let’s just play with the scenario that, you know, I asked you to, you know, write up something about ribeye steak.

So you use chat, CPT, and I don’t bother to check it, and it just goes out.

And customers start going up to the meat counter going, I saw your piece on ribeye steaks, I want that thing, then the people who are standing behind the counter are going to have to say, well, we don’t have that that’s not a thing.

That’s not true.

Or, you know, whatever the information is that needs to be corrected.

Now you have a bunch of angry customers going well, you lied to me, it was a bait and switch, you put out this information to tell me that I could get this thing and now I’m hearing you’re telling me I can’t get it? Well, now I’m going to go online.

And I’m going to tell people how you lied to me, regardless of the fact that like, it was an honest mistake, and nobody did the digging to fact check it.

Now you have a horde of angry customers who are going to ruin your reputation on Facebook groups, and Yelp reviews and Google My Business and you know, stand outside and picket the whole foods because they didn’t get the marbled ribeye that they were promised in this one innocent little piece of content, it can have a far reaching, you know, a domino effect if you’re not doing your due diligence.

And this all again goes into performance.

And so, you know, we’re sort of getting a little bit ranty and like off topic, but if we bring it back to performance, and how are we measuring the effectiveness, I would ask people who were using chat GPT to generate new content to kind of keep track of how many things the system gets wrong, how many things you have to correct not just editing in terms of grammar and tone, that kind of thing, but the actual data and the facts that they are getting wrong.

Because that to me says this is not the right system for you to be using.

If it’s consistently getting information wrong about the type of business that you’re running,

Christopher Penn 19:22

right, or you need to spend a whole bunch of time learning prompt optimization, which is a whole sub discipline of using it.


Katie Robbert 19:30

again, the trade off like that’s the time it’s taking.

So then just write it yourself.

Christopher Penn 19:37

There is the whole write yourself thing, but I would say that the the advantage of the tool is once it’s kind of like writing code once you once you’re done with the code, then it requires relatively minimal tweaking to continue to have it run.

So that is a sustainable advantage of any large language model is once you’ve got the code in place, then it’s just right writing that code over and over again, with little variations from time to time.

The other thing that, again, people haven’t given enough thought to is, when you’re doing the prompt creation, are you using measurements to inform that? So, for example, when you are if you’re, if you’re your more advanced user, and you’re fine tuning the model for your business, are you feeding all your content? Are you feeding it just your best content? Right, because there’s the piece that the intern wrote.

And then there’s the piece that your your co founder or your your CEO wrote, which you want to to inform the model or inform the prompts that you feed the model to generate? If you are asking it to rewrite a piece of content that maybe was from an interview? Do you want to train it on the interview content itself? Do you want to refine the entire interview? Or do you want to refine just certain parts, because certain parts may or may not need it? And that has to come from measurement? Right.

So if you are looking at top performing content, you should be looking at the metrics around that to say these are the these are the things that we want to use to guide the model to begin with, so that it doesn’t generate mediocre content.

Because again, without enough specificity, it’s just going to spit out the mathematical average of what it knows about a given topic.

You have to be incredibly specific and use very specific jargon.

It’s like SEO.

And a lot of ways.

It’s like doing keyword optimization for SEO, except it’s prompt optimization.

So what keywords are you going to use to have it have the language model trigger the proper associations to generate high quality content.

And this, again, is where your research and your data has to inform it if you are trying to rank for marketing consulting, semantically, what are the highest rated or top performing terms that are surrounding marketing consulting, from a language perspective from your SEO tool, that should also be in that prompt, so that it generates language that is appropriate to create that high performing content, not just write a blog post about marketing consulting, right, you’re gonna get average content, write a blog post about the application of Porter’s Five Forces to a marketing consulting perspective, on SAS base software companies with revenues above $500 million.

Right? That would be a much better prompt, if you were to look at the semantics around it.

Again, I don’t see that happening.

Katie Robbert 22:28

But then you still have to measure the effectiveness of that content did exactly what he wanted it to do.

And so this when we talk about the performance, the fifth P, in the five p structure, you’re right, Chris, I don’t know that companies have gotten to the point where they’re, well, let me take a step back.

Chad GPT-3 is still very much a shiny object to a lot of companies to a lot of people, they’re still just trying to figure out what the heck it does, how reliable it is, where it fits into their marketing, and a lot.

And this has been the conversation about AI will AI take my job, a lot of people are nervous that chat GPT will take their job, especially if their sole purpose at the company is to write content, write headlines, you know, whatever that piece is.

And so if you’re concerned that chat GPT-2 is doing a better job than you, then this is where you start to do that a b test to really measure the performance.

This is where you start to learn really good prompt engineering, so that you’re telling chat, GPT-2, what to write, and then you can take it to the finish line, and then measure the effectiveness of that piece of content.

And so in terms of measurement, keep track of the time that you’re investing into the system, in terms of learning it, tweaking it, how many things it’s getting wrong, that you’re having to correct, do some A B testing between human generated content and AI generated content? And then, you know, just start to understand, is it enhancing your brand? Is it damaging your brand, all of those things, you know, go into understanding the performance of introducing a tool like this into your ecosystem.

Christopher Penn 24:10


And the AB test, to be real clear, is not a complicated thing.

Gather the URLs from your content, go into Google Analytics, your web analytics software, get the sessions for each piece of content, and then go in and tag you know, this was AI, this was not AI just go down the list of pieces of content.

You want it to be at about a 5050 split.

So you want half the content to be human generated purely half the content to be AI generated, and then sort your table by AI generated or not and do a median does the median number of sessions for each and if they’re comparable, great.

If one is say more than one standard deviation away from the other than you know, okay, that one clearly has a statistical advantage over the Have it and you can look at, you know, obviously things like page, time per page to see if the content was engaging and kept people on it, you can look at things like bounce rate for that page, you can look at things like participated in conversion within that session, those will all be additional metrics.

But the basic AV test is did you get even get people to read the content, right? Do the way that the machines headlines and body copy outperform the the human generated and I would encourage you do this for 30 days, right? Try and publish new pieces of content every day, because machines good at writing content and to try and get 10 of each 10 machine made 10 human made.

And then, at the end of 30 days, look at your content, you’ll look at your numbers and say is it qualitatively but quantitatively better? If it is, then every quarter rerun that test to make sure that its quality is not changing, and that your quality is is you have also have not lost the ability to write.

Katie Robbert 26:09

All right, well, sounds like some people gotta get their act together and start to put together a measurement plan for their shiny new object of chat, GBT.

Christopher Penn 26:18

It doesn’t, you know, almost all of our content is human generated.


I’m thinking it might be time to, for us to do the reverse and say, Okay, well, we have our list of blog posts, we’ve done human versions of maybe, you know, the next 30 days, we’ll try putting up a chat GPT versions, we’ll label them, you know, chat GPT-2 generated.

But to see how they perform, you know, well, I think that’d be kind of a fun test.

Because we know we’ve got human based content that did well, we can we can set up some social shares and share each post.

You know, same day two posts get shared once human and machine generated.

See, see what happens.

I think that’d be a fun experiment to try.

Katie Robbert 26:59

You got a cloning machine back there, too.

Christopher Penn 27:01

I’m not good yet.

But well, hey, come on, shut up.

It’s gonna save us all the time.

So we fine.

Katie Robbert 27:09

Okay, you let me know how that goes.

Christopher Penn 27:11

I will, what day of the week to let me put these up.

But no, I think I think that would be a good live example of like, here’s how we test this thing out.

Katie Robbert 27:21

Alright, I will add that to our list of upcoming live stream.

So for those of you listening, stay tuned for that our live stream is every Thursday at 1pm Eastern on our YouTube channel.


Christopher Penn 27:35

All right.

Well, I think that’s about it for this week.

If you’ve got some shot GPT-2 measurement stories that you would like to share your expertise.

pop on over to our free slack group go to trust for marketers, where you and other 3000 other human marketers are asking as each other’s questions every single day.

And wherever it is you watch or listen to the show if there’s a platform that you would rather have it on instead, you can find at trust podcast and while you’re there, please leave us a rating or review.

It does help share the show.

Thanks for tuning in.

And we will talk to you soon

Need help with your marketing AI and analytics?

You might also enjoy:

Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday!

Click here to subscribe now »

Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new episodes every Wednesday.

One thought on “In-Ear Insights: Measuring ChatGPT Performance

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This