So What header image

So What? Q3 2023 Generative AI Bake-off

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on YouTube Live. Be sure to subscribe and follow so you never miss an episode!

In this week’s episode of So What? we focus on generative AI in Q3. We walk through what’s new with generative AI in Q3, how to think about using large language models in your marketing and what changes to expect over the next quarter. Catch the replay here:

So What? Q3 2023 Generative AI Bake-off


In this episode you’ll learn: 

  • What’s new with generative AI in Q3
  • How to think about using large language models in your marketing
  • What changes to expect over the next quarter

Upcoming Episodes:

  • TBD

Have a question or topic you’d like to see us cover? Reach out here:

AI-Generated Transcript:

Speaker 1 0:48
Well, hey, how are you everyone? Welcome to SWOT the marketing analytics and insights live show. I’m Katie joined by Chris, how’s it going, sir?

Christopher Penn 0:56
It is going fantastic. How about you?

Speaker 1 0:59
It is really beautiful out today. Yeah. And that always, I don’t know for me that always helps things. John is on vacation down the cape. Just taking it around narwhals. I think he’s hunting our walls. And he is just literally just running circles. He’s been trading for months. And he is just running and running and running until he comes back from vacation. In his in his absence today, on today’s episode of so what we are covering q3 2023, generative AI bakeoff. And so we’ve done this previously in q1. And we’ve done it also in q2. And the reason we want to repeat it once a quarter is because generative AI, unless we’re living under a rock has been changing so rapidly, that the functionality and features of each platform that we review, by the time we get to be about a week, a month, and even a quarter later, everything has changed drastically. So the use cases also need to be updated. So today, we’ll be talking about what’s new with generative AI and q3. How to think about using large language models in your marketing specifically and what changes we can maybe expect and anticipate over the next quarter. So Chris, where would you like to start today?

Christopher Penn 2:14
Well, let’s talk about how we do these bake offs. Because Katie, you are going to be the judge of the bank. Okay. So here’s what we do we have Yeah, we can do a hammerhead and pestle or we can do spreadsheet you that’s fine. We have five different language models. Last time we did Bing, Bart, ChatGPT and GPT-4. All this time we’re doing Bing Bard ChatGPT Claude from anthropic and LM studio using the myth Omaxe model.

Speaker 1 2:41
And so we’ll talk why are we changing the tools,

Christopher Penn 2:44
we’re changing the tools because as newer and better models become available, we want to see how they compare on the same set of tasks. So what has not changed is the same set of tasks that we did the last couple times are we’re going to be repeating those exact tasks to see how the selection models works.

Speaker 1 3:00
Gotcha. All right, I have my trusty pad and pencil to do my scoring with.

Christopher Penn 3:05
Alright, and here’s how it’s gonna work. We have 12 tasks broken up across six categories that we’re gonna be testing each model with, and we’ll kick off all the things and then we will evaluate the responses. Katie, so here’s the scoring works. You’ll give a model two points. If it did the task, well, quickly return factually true results, complete results or an expected output. You will score one point if it did the task, but it fell short. And zero points if it refused to do the task was incapable of doing the task or generated something totally useless.

Speaker 1 3:42
Got it? And remind me again of what tools we’re looking at.

Christopher Penn 3:46
So we’re looking are being barred ChatGPT Clogged from anthropic and LM studio

Speaker 3 3:54
being barred. He clawed So there’s four, five, what was okay Bing

Speaker 1 4:07
Bing, Bard. ChatGPT quad? At LM studio, LM studio got it. Okay, I am ready.

Christopher Penn 4:19
All right, we’re gonna kick off with our first one. So the six categories as you’ve probably know, the six categories are generation summarization, extraction, rewriting, classification and content answering. So the first thing we’re going to do is we’re going to ask a generation task of each of these models and the generation task we’re asking is, write an outline for a blog post about the future of content marketing in 2024. So let’s see how we’re doing here.

Speaker 1 4:52
And you’re using the same prompt across all systems

Christopher Penn 4:57
exactly across all five systems are using the same prompt So

Speaker 1 5:00
it looks like a fairly basic prompt. And is that purposeful?

Christopher Penn 5:04
That is purposeful. So we’re gonna have some more complex prompts, but we’re trying to simulate what perhaps the non skilled person would use for some of these tasks, because generation is probably the most common task that people use these tools for. Okay. All right. So let’s see how we’re doing. It looks like LM studio is done with its first run. So here’s its output of an outline for a blog post about the future of content marketing. So we see increased focus on personalization, interactive content formats, social responsibility to content strategies to change your content marketers and conclusions. Katie, how did LM studio do 01 or two points? Um,

Katie Robbert 5:47
I mean, I would say that that’s a solid outline, because you can get into things like tools and processes within the content itself. So I think that I think that’s a solid outline. I think that’s a good baseline.

Christopher Penn 6:00
Okay, so I think so you get two points for LM studio on that task.

Speaker 1 6:04
Yeah, I think so. Okay, let’s go in struck me as factually incorrect.

Christopher Penn 6:09
Okay, let’s take a look at Bing entered. Here’s the introduction, current state, interactive and immersive content, personalization, segmentation, shift from quantity to quality, social media and content marketing become integrated, new platforms and a conclusion.

Speaker 1 6:29
I would also give that one or two. I mean, I would say that again, that’s solid. There’s nothing wrong with it. So that one, right, Claude,

Speaker 3 6:36
there’s been, this has been, Oh, my goodness. Okay.

Christopher Penn 6:40
Let’s go to Bard, Bard. So this isn’t Ching. Bing did not return an actual outline, LM studio and Bard have returned actual outlines. What is content marketing? Why is important AI and machine learning visual content, video marketing, personalized content, data analytics, holistic approach to content marketing at a conclusion, and then comes up with some specific trends like visual, explaining those in more detail.

Katie Robbert 7:06
So interesting. Now that I’m seeing the bard compared to being an LMS, I might actually drop both LMS and Bing down to one point and give this one two points.

Unknown Speaker 7:19
Okay. Okay.

Speaker 1 7:22
Because what’s interesting, so this one, and this is hard, it’s hard to score them without sort of knowing all the pieces. But this is the first one that’s actually mentioned, the rise of artificial intelligence and machine learning,

Katie Robbert 7:36
versus the standard, you know, quantity, quality over quantity, so on so forth. Like that’s, now that I’m looking at it in comparison, that’s fairly generic advice, right?

Christopher Penn 7:49
Okay. ChatGPT goes in. This is using the GPT-4 model, AI powered content creation, Voice Search and smart assistants, video content, personalization at scale, purpose driven content, that’s a new one. interactive and immersive content, user generated content, sustainability and green marketing, long form content, data analytics, community building and ephemeral content. So the ChatGPT, prompt come up with 12 Different things for its outline.

Speaker 1 8:17
That’s very thorough. I’m gonna give that a three, which I know wasn’t on your screen. But you know, it’s it’s done well, but it’s also above and beyond. Because that’s a very, you could break that down into multiple pieces of content.

Christopher Penn 8:32
Exactly. And Claude finishing up here with Claude, so the future content marketing, personal targeted content view, interactive and immersive content, Voice Search, conversational content optimization, helping versus selling AI generated automated content rises, short form content and micro content.

Unknown Speaker 8:49
So again, that’s a solid outline. I would give that a I’d give that a two.

Christopher Penn 8:57
Yeah, I would give that to you based on the way you were doing it. ChatGPT. And Bard definitely came up with with good stuff,

Speaker 1 9:03
you know, and it’s it, like none of these. You know, one of the things that we know is that generation is one of the things that these tools are doing, it’s literally called generative AI. And so if you’re asking for an outline, unless it’s giving you completely incorrect information, it’s going to do a pretty good job. And so I didn’t see any of these tools fall short on the task itself. Some of them were more thorough than others, but they all perform the task.

Christopher Penn 9:31
Yeah, I would agree with that. I think that’s that’s a good way to summarize it. Okay. All right. Ready for the next one?

Unknown Speaker 9:41
I am ready.

Christopher Penn 9:42
This is our second generative task. We have asked it to write actually, I think I did that wrong. I copied the prompt incorrectly. Let me start over.

Speaker 1 9:55
I believe this is what you call an error between what the

Christopher Penn 10:00
keyboard and mouse between the keyboard and the Chair.

Unknown Speaker 10:02
Thank you. So, what is the task that you will be asking?

Christopher Penn 10:06
I am asking it to write a list of recommendations for preventing COVID. And specifically, for this one, we are looking for three items that are that should be included three items that are factually true that we want to see them talk about the three items that I am looking for are masks, vaccination, and ventilation. So I would count that response as incomplete. If it does not touch on all three categories. Okay, that’s totally okay. So let’s see for LM studio, we have wear masks, social distancing. Avoid large gatherings, wash your hands, cover your nose and mouth, avoid touching a face clean and disinfect the touch surfaces. Stay home if you’re feeling unwell, follow quarantine guidelines if you’ve been in contact with someone and get vaccinated, so we have masks, we have vaccination, we do not have ventilation here.

Speaker 1 10:59
Okay, so do you want to call that a zero because it didn’t get all it didn’t hit?

Christopher Penn 11:04
I call it a one because it got two out of three. Okay,

Unknown Speaker 11:08
and this one is what this IS

Christopher Penn 11:10
LM studio.

Speaker 1 11:11
Hello studio. out I’ll be you know, as we’re going through this, I’ll be honest, I don’t think I ever see ventilation on any list. And so it almost I’m not gonna say it’s a trick question because obviously it’s an important part of you know, preventing COVID But I personally never see included on a list I see these items I see vaccination I see wearing a mask I never see ventilation.

Christopher Penn 11:37
Yep. So I guess we’ll see if anyone if any one of these machines can come up with a good answer. So let’s do let’s take a look at Bing so this is now Microsoft Bing. Here’s a list of recommendations stay up to date with your vaccines improve ventilation and spend time outdoors. wear a mask get tested follow isolation seek treatment if you have it. Avoid contact with others. So this hits all three points. Alright,

Unknown Speaker 12:00
so they so Bing gets a two.

Christopher Penn 12:03
Yep. Okay. Did a good job. Let’s check in on Bard. Bard says get vaccinated were a mess. Keep your distance avoid touching your nose and mouth. So it looks like Bard has two or three.

Speaker 1 12:15
Okay, so Bard will get a one. Yep. Okay.

Christopher Penn 12:19
Let’s check in on ChatGPT. So ChatGPT vaccination facemask hand hygiene sanitizer, avoid crowded places, ventilation is number 12 on list ensure good ventilation in indoor spaces, facials, eye protection, etc, etc. So ChatGPT full marks. Alright, have some comments from folks. On the previous one Erica says the she likes that it gave actual outlines instead of drafts, which is true that we asked for. In the prompt itself shall says ventilation is one of the key things AIHA and CDC have in the back to work guidelines. And Glenn says it’s learning medical data wouldn’t ventilation be included? Yes, absolutely. So these things should know that. So let’s look at Claude now Claude says get vaccinated wear a mask improve ventilation of gathering indoors practice good hygiene, stay home and so on. So it’s a clog gets full marks as well.

Speaker 1 13:11
So interesting. So Bard and LMS. Missed ventilation. Everyone else got it? Correct. Correct. That’s right. And I think the one of the pro tips here is if you are vetting different generative AI systems, it’s okay to ask questions that you already know the answer to because that’s going to give you a better sense of the kinds of answers you can expect back so in this one, Chris, you knew exactly what you were looking for. And now you can start to narrow down the field of systems that you would use based on are they giving you the answers you want?

Unknown Speaker 13:46

Christopher Penn 13:49
And particularly for things that are maybe higher risk or more mission critical things you absolutely want to be using. fact checking some fact checking questions to quiz these things to say like, Do you Do you know the thing, right? Okay,

Unknown Speaker 14:06
what what’s the next task?

Christopher Penn 14:08
The next task we have moved on to is a task called extraction. We’ve moving on from generation to extraction. And I’ve asked the tools identify the company name and job title from this job listing I have this URL from Virgin Media. The correct answer is obviously Virgin Media is the company name the task the job title is senior PMO. Analyst. So Bing successfully identified both the company name and the job titles with full marks are being great. Bard says Virgin Media oh two and the subtitle is senior PMO analysts so full marks there. Okay. ChatGPT says, I can’t access external URLs, but when the Oryol information it correctly inferred both things. So Full marks for ChatGPT for guessing correctly.

Unknown Speaker 14:58

Christopher Penn 15:00
Claude says Virgin Media Oh Chu and job title is senior PMO analyst.

Unknown Speaker 15:05
Straight to the point, no fuss, no muss.

Christopher Penn 15:07
Exactly and LM studio. The company was originally oh two in the job title of senior PMO analysts so well marks across the board.

Speaker 1 15:14
Yeah. And it’s, I feel like that’s one of the underutilized use cases for sure.

Christopher Penn 15:23
Okay, our next task is going to be a bit of a trick task, it’s a bit of a trick task. Because the tools can’t all do exactly the same thing. I have data from a PDF, but I want these tools to extract now LM studio can’t do that. So what I have done to allow some apples to apples is extract the PDF data itself and put it into this, the only tool of these lot that can do PDFs natively is claw, the rest will have to use the Copy Pasted version. So let’s kick off LM studio. Let’s go to Claude now. And I’m going to just give it the prompt of extract data from the following PDF. And we’re going to attach the PDF from our bakeoff folder here. This is a PDF that contains a number of text tables in it. Let’s go ahead and start a new one in Bing. And we’re going to do the new task. In our always when you’re doing kinds of these bake offs. It’s really important to always start with a clean session so that you don’t have any previous data leftover. Okay, so ChatGPT said, I can’t even process that. So ChatGPT gets to zero.

Unknown Speaker 16:40
All right.

Christopher Penn 16:44
Claude is pulling out the tables correctly. You can see here that it’s it is successfully extracting the data so Full marks for Claude. Okay. BARDA says it has BARDA successfully processed the tabular data as well. Okay. And Bing. Being pulled out one of the tables, Bing did not pull out all four tables so that it gets one point there.

Unknown Speaker 17:13
Is it still going though?

Unknown Speaker 17:15
It is still going?

Speaker 1 17:20
I hope this helps you out standard data. Smiley face. Yes. Listen, a smiley face does not help if you didn’t do the task correctly. I don’t care if you’re human or a robot. So Bing gets a zero actually gets a one,

Christopher Penn 17:35
Bing gets a one. And LM studio. Can’t do it context like this too long. So it gets to zero.

Speaker 1 17:42
Okay, now, with this, how much of that is user error? And how how much of that is machine? Is that all machine error? In this case,

Christopher Penn 17:51
because because this is an extraction task. We’re not asking it to come up with anything new I, in some cases, like with ChatGPT, and LM Studio is a technical technological limitation is the limitation of the machine. The prompt itself was not a crazy kind of prompt. But it is it is still an issue. So I would say it’s it is machine related in this case. All right. Okay, let’s kick off the next round here. This next one is a summarization task, but it’s an external summarization task, meaning that we’re relying on all the knowledge that is within these different systems, and not necessarily asking you to summarize a particular document. The question is, there is a belief that after major traumatic events, societies tend to become more conservative in their views, what peer reviewed, published academic papers support or refute this belief, cite your sources. The critical thing we’re looking for here is citation of sources. And we do want to do a bit of fact checking. So as, as some of these things come up with these papers, we might want to to identify if we can identify if that is, in fact, an actual source or if it is hallucinating. So let’s go ahead and do it.

Speaker 1 19:05
And as that’s happening, can you just go back and re explain what a hallucination is in this particular context? Because obviously, when you talk about human hallucination, I’m, it’s a bit. It’s similar but different from when we’re talking about it in the context of these machines.

Christopher Penn 19:22
Sure. A hallucination means that the language model has a statistically relevant answer that is factually wrong. So the simplest way to explain this is with an example if you ask the question, we will actually ask you this question towards the end. who is president the United States in 1566? I said 1492. Statistically, what name would you associate with the year 14 92k?

Unknown Speaker 19:50
Christopher Columbus,

Christopher Penn 19:51
correct. In that instance, with Christopher Columbus, the machines when asked a question will have A statistical association with that year and it will have a statistical association with the United States. And so it will glue those two things together and tell you that in older versions of the GPT-3 models, it would say we the President knighted states and 14 it was Christopher Columbus because statistically those are the intersecting parallels is factually wrong. Right. So that is exhibit hallucination.

Speaker 1 20:23
Got it? Got it. It’s also buying time to transfer my score list to something electronic.

Christopher Penn 20:32
Okay, so we have here a response from LM studio, including a paper called The impact of war on political ideology evidence from post conflict societies. However, when I google that paper name, it does not exist. So in this case, LM studio has hallucinated. Its answers, and looking up now the trauma ideology, one also a hallucination, so. Oh,

Speaker 1 20:59
well, so let me ask you this question. You know, because Google can’t find it, it doesn’t exist, or it’s just not online? Or, you know, how does how, how are we knowing that that’s not a thing.

Christopher Penn 21:12
Generally speaking, at least with academic papers, papers and and their authors, you can all you can almost always find the abstract of the paper title online, you don’t have to, you don’t usually pay like LexisNexis or whatever to access to the paper. But generally speaking, most search engines are really good at finding the actual papers.

Speaker 1 21:31
Got it. And in order for the generative AI system, to have brought it into its learning library, it needs to have existed online in the first place. So if it’s an academic paper that never made it online, then generative AI theoretically wouldn’t know about it at all.

Christopher Penn 21:50
That’s correct. That’s correct. All right. So this one LM studio gets a zero because it gave us completely wrong answers.

Unknown Speaker 22:00
Got it credible, but wrong.

Speaker 1 22:02
Oh, and that’s the thing. If you say it with enough authority, people will believe you.

Christopher Penn 22:06
Exactly. Here we have Bing Bing has three examples. And let’s go ahead and see. And in fact, these examples do exist. Okay, turnout. So being full marks,

Unknown Speaker 22:24
Bing, get the two got it. Yep.

Christopher Penn 22:25
Let’s see about Bard. Bard says September. Okay, let’s take a look and see now if Bard has in fact, detected these things, there are no results for this. Paper. Let’s try this one. Nope.

Speaker 1 22:52
It’s an interesting exercise too, because I think there’s, you know, an assumption of if it’s citing sources, then it must be correct. But you still need to take that extra step with that due diligence, and make sure that the sources actually exist, because, you know, I could write down a bunch of sources that don’t exist. And, you know, I couldn’t tell you where they came from. And these machines are no better than us. When it comes to those things. It’s essentially, you know, it’s just always err on the side of it might not be correct, and then double and triple check with the information that you’re getting.

Christopher Penn 23:25
Exactly, Bard gets zero because this hallucinated these sources as well. All right. Let’s go ahead and take a look at ChatGPT It says count with six different sources. Let’s go ahead and start Googling. To see if these are in fact real papers or not.

So the first link doesn’t exist the first paper does not exist. The second paper does exist. Interesting. Let’s take a look now at the third paper here. Third paper does exist,

Unknown Speaker 24:01

Christopher Penn 24:11
Fourth paper exists

if paper is technically not a paper, oh, yes, it is. It’s electric. It’s an entire book, but it does exist. And let’s take a look at the 6161 is a book as well. That is from excellent MIT. Let’s just

Unknown Speaker 24:43
double check the second yeah,

Christopher Penn 24:44
go check that first one. First one does exist is from RAND Corporation. So ChatGPT gets full marks for not only coming up with sources but a lot of sources. So good job ChatGPT.

Speaker 1 24:57
And I would like to note that If we are not taking that extra step to make sure that the sources are relevant to the topic, that would be an additional layer of fact checking that you as the user would need to do. So it’s great that these sources exist, we would then need to then dig into each individual one to say, Well, is it even relevant to the question that I’m asking?

Christopher Penn 25:18
Exactly. So let’s see how Claude did.

Unknown Speaker 25:25
Banano. Just,

Christopher Penn 25:30
yep, that exists.

Unknown Speaker 25:35
Let’s take a look at this next one.

Christopher Penn 25:40
That’s from 2011. Yep, that exists. We know this one exists, we just collected applaud also full marks club came up with a good selection of papers as well. All right. All right. So we’ve now covered generation and extraction. We’ve done one summarization. Let’s do another summarization. So we’re going to give this thing, each of these things is a piece of doctrine, an entire transcript. And the transcript is, of course, from one of our shows. So it’ll be take a look here, which which shows you we used last week, so what, yeah. Find it. Up here. Okay. So let’s start a new task. The new task is going to be the summary. Summarize the following points from this all and

Speaker 1 26:39
which is good since I was on that I can tell you if it’s accurate or not.

Unknown Speaker 26:42
Exactly. Let’s do

Speaker 1 26:49
this is one of those flasks that is super handy. For people who end up sitting through I know a lot of us have back to back to back to back to back meetings all day long. If you are recording the call, or having it transcribed in the background by something like an otter. Otter can summarize but you can also use these generative AI tools to summarize what the heck happened in that call. Just remind me what we did.

Christopher Penn 27:17
Yep. So LM studio right away? Can’t do it. It’s too long. So no marks for LM studio.

Speaker 1 27:24
So would the solution be to break it up into smaller points? Or does that disrupt the flow of the summarization?

Christopher Penn 27:31
You could break it up into into several points, but then you’d have a lot of extra points you have, you know, many more points from each of the chunks that you then have to summarize the summary possibly.

Speaker 3 27:41
Got it so that okay, makes it not as efficient as not as efficient.

Christopher Penn 27:46
Let’s go to Bing Bing says the topic of the call was how to use large language models. The Alexia says website CRM, new data and speaker Christopher Penn and Katie robear, who are experts in machine learning and AI, six broad categories, demonstration of ChatGPT and the challenge limitations of language models.

Speaker 1 28:04
fairly generic, but it’s not inaccurate, so I’d give it a one. Okay. And which are those Bing?

Christopher Penn 28:10
Yeah, that’s Bing. Okay. So Bart says, here’s the meeting notes, attendees topics, key points, large language models, benefits, action items. Next steps, five major points.

Speaker 1 28:23
That to me is a little bit more useful because it’s broken out in such a way that I can actually then hand it off to someone else. So I would give this one or two. Okay.

Christopher Penn 28:33
ChatGPT two messages too long can’t do it.

Speaker 1 28:38
Is a zero. I’m surprised ChatGPT has been doing well. But it sounds like it’s starting to fall behind the other tools.

Christopher Penn 28:46
And Claude, here are the five key points and meeting notes and transcripts, large language files, CRM data language model summarization, those are not five points, call those eight. Here’s the five key points. Don’t forget old analog state alongside new tools. CRM data is available under utilized models can extract data from PDFs and complex Excel files, summarize, reports presentations, rewrite analytics data into recommendations. So that’s cloud.

Speaker 1 29:10
In comparison to what Bard gave us, I would give this one because Bard clearly gave us the most useful right out of the box version, where it gave us the summary the actions, the next steps, the attendees like it gave us above and beyond what we asked for. And it was all still factually correct.

Christopher Penn 29:29
Yeah, I would agree with that. I think that’s that’s a good way of phrasing it. So then let’s go ahead and leave just get a piece of code for the next one because I need to get something ready. There we go. So the next one is we’re now moving into rewritings going to have these tools do some rewriting of of content. The specific piece of content we’re going to have to do is rewrite is very, very angry memo. That is completely unproven. Rational II thought to rewrite it in a professional tone of voice. So we’re gonna have this one going let’s start a new chat here. Apologies if anyone is reading this aloud with a screen reader, this does contain profanity. All right, let’s see how LM studio did. So LM studio took the thing says, I just especially find you Well, as you know, spherical industries expressed my disappointment confusion regarding the recent events around our contract to the K 5000 translocator. You have chosen to award the contract and Maxwell Lord instead of us, and we’re that sort of caused by a company’s extensive experience. Nevertheless, we understand the competitive nature and ensure you respect your decision. So Katie, how would you rate this? Um,

Speaker 1 30:51
so give me I couldn’t see the full message. So the context is that I’m the CEO of a company and I’m trying to figure out why I didn’t win a contract from someone I was bidding with. And I learned that the contract was awarded to a different vendor. And so I’m clearly upset because I lost out on that work in revenue. Is that correct? Okay, yeah,

Christopher Penn 31:13
the prompt is rewrite the following email in a professional tone of voice the email is correspondence from Jack sphere, CEO of spherical industries to Lena Luthor, CEO of Elkhorn.

Speaker 1 31:22
So in this in this instance, it still reads very aggressive and very angry. So it it completed the task, but I would say it’s still mediocre. So I would give it a one because there’s still a lot of passive aggressive language in this particular response. So which Okay, is this?

Christopher Penn 31:43
This is LM studio, LM studio. Got it. Alright, let’s move on to our new devices. Bing has come up with a long, long email saying, I’m ready to express my disappointing frustration with your decision to award as you know, we’ve worked closely with your team for the past six months to develop customized solution that meets your specifications. We invest a lot of time and money and resources to accomplish we could deliver the best product service to you were under the impression we had a mutual agreement, strong partnership restrictions made you have chose to go with a different vendor without giving us Chronos explanation, we feel this is very unfair and unethical way to treat us respect to request, you reconsider your decision. So swinging,

Speaker 1 32:20
yeah, I would give that one as well. Again, it completed the task, but it’s still a very aggressive. From a business standpoint, you know, if you’re looking to win business from someone, you don’t attack them verbally, which is what this this example and the LMS example was still doing.

Christopher Penn 32:38
Okay, let’s try Bard. right to express my disappointment. The spherical industry is not worth the contract, as Maxwell does respect to business, I believe that we have a stronger track record in this field, I’m confident we have been a valuable partner, I’d like to request you reconsider your decision, Bing AI bards came up with a, I think the five the most professional response so far,

Speaker 1 32:57
that one is definitely more professional. Because what you don’t want to do in that particular use case is give the other person even more reason to continue to say no to you. And so if anything comes across as aggressive or off putting, then they don’t have a reason to get back to you. But in this particular example, it’s, you know, I’m writing to express disappointed however, I respect the decision that you made. I would also like you to reconsider all the great things that we do. So it’s this one to me, if I got this out of the car, okay, you know, I can see their point of view and I don’t feel like I’m gonna get into some sort of verbal altercation with the other person. So this one, this is barred, I would give this one a two.

Christopher Penn 33:39
Okay, let’s check in on ChatGPT. I hope this message finds you all express my surprise and conservative around the decision award to match the Lord. My understanding is that was in prime position to us because contact would have been significant revenue stream for us. I’d like to discuss the reasons why it’s unexpected change, I value our past collaborations, help you find a way to WrestleMania constructively

Speaker 1 33:59
you know, when you say something like this would have been a significant revenue revenue stream for us it feels very slightly self centered and whiny Yeah. Like, that’s not the person awarded it. That’s not their problem, right? Like you, you did not give them a compelling enough reason to choose you. So it’s not it’s not going to keep them up at night that you lost revenue. So I would give this one to one.

Christopher Penn 34:24
Okay, and let’s check in on Claude hope the phone’s just trying to record the status. Let’s say you said what a surprise this development would have been major opportunity for spherical from financial and innovation perspective. Regarding misunderstand my part of trust, you have valid reasons to go in a different direction hope to collaborate in the future. This one again, it is still that that self centeredness about being a major opportunity, but it’s also at least this one is also not attacking the recipient.

Speaker 1 34:51
Right. And so it’s definitely you know, when you’re thinking about rewriting in a professional tone, what I would then do is I would take these and I would To rewrite my own prompt and say, Please remove any self centered language, please remove any, you know, indication around financial opportunities. And then you know, do not use aggressive language. So I would give this one a one as well. And so in this case, it seems Bard was the only one that really truly accomplished the task.

Christopher Penn 35:19
Exactly. Now, here’s a very interesting question. We don’t have time to do on this broadcast, but I would be interested to try it. What would happen if instead of the recipient being Lena Luthor, I changed to Larry Luth? Or change the gender of the name, it’d be interesting to see if the responses were identical, or if they were substantially different?

Speaker 1 35:38
That is a great question. I know anecdotally, that the response is going to be different. And I feel like there’s a whole other episode that we can bank for another time.

Christopher Penn 35:49
Exactly do that. Or maybe we’ll do it in a day to diaries in the Trust Insights newsletter,

Speaker 1 35:52
I think that is an excellent, excellent idea. Just sort of point out the biases that we as the users need to be aware of.

Christopher Penn 36:02
Exactly. Okay, our next rewriting task is going to be a code based task. So we’re going to feed a bunch of code to each engine, and we’re going to have it rewrite the code. So the specific task was saying is, you’re an expert programmer, you’re going to go inspect the code for bugs, and then you’re going to optimize it for efficiency and compactness, and we want to see how well each of these tools can do with this particular task. Elm studio already has given up the ghost, it’s there’s too much text. So

Unknown Speaker 36:34
is struggling today.

Christopher Penn 36:36
To be fair, it is it is a very different tool than the SAS based services. There are use cases for it. But again, that’s probably for another show. So here we go with our code. So Bing, this is already starting to work. It says you’re you’re loading too many libraries, you’re not using you’re using the deprecated mutate, if function, using the pipe operator from base R, then it was less readable than McGruder, which I disagree with personally, using slice head instead of top n which is more efficient, you’re saving PNG files. And you should save SVG files, which have scale without losing quality. SVG files, though, are not efficient. They’re like really large. And so it is now going through and it is rewriting the code adding in comments on everything, which is very nice, because I never comment my code

Speaker 1 37:31
I’m gonna give it points just for that. I’m also not going to deduct points if you have a personal disagreement with a computer system.

Christopher Penn 37:40
Exactly. But here are just going through rewriting code. So in terms of accomplishing the, the task it was given, it is Bing as doing that, I would say very successfully. Okay. All right. Let’s check in on Bard Bard says the line is commented out but still being executed. That’s not true. Likes, likes plus six. That’s also not true. Blind engagement. It’s Should we just toss plus 100 engagement rate should be expressed as percentage that is also not true. Remove constant move empty into a single function. No, that’s not how that works. You see, mutate F. That’s fine. So here’s the revised code. So it went through and commented stuff as well. Okay. Did you do yep, that’s correct. Okay, so even though it actually no, it, it stopped. It didn’t finish the job. Okay, you need to keep going there. But so I would get bored of wanted it almost did it?

Speaker 1 38:39
Well, and it in, it didn’t air out. And you know it? Yeah, it got the task mostly done. It just wasn’t fully correct.

Christopher Penn 38:49
Exactly. Check in on ChatGPT. So ChatGPT says upon a thorough examination, you’ve got redundant library loadings. Your directory change is usually not recommended using the here library and overriding its function. It’s inefficient loading, ambiguous rename function, environmental functions, avoid setting global variables, potential file path issues. Here’s the commented code. So it went through it also competent up to code. C. And it went through and it successfully did everything. And it even made some efficiency changes. So which is actually nice. So Full marks for ChatGPT. All right. And now Claude, so Claude, says here’s how it improved the code. It went through. Its commenting is not as good like it did it did not do as good a job of counting its key but it is commenting. It is loadLibrary the top set options except at the top use more readable variable names like yet that’s not more readable. Use mutate out.

Unknown Speaker 39:57
Is that an opinion or a fact?

Christopher Penn 39:59
That’s opinion, you snake case file names. Also a lot of the key changes, it’s made our reference base. They’re not actual optimizations, but still it did complete the task. So that’s one. Yeah. So I would I would give that a one. Okay. All of the different modes of being still going. Oh,

Speaker 1 40:21
to break. Bing we already gave it to. So

Christopher Penn 40:25
yeah, I would give, I would say Bing and ChatGPT gave both gave the best responses. Okay. All right. Let’s move on to classification. We want to classify some stuff right now, what we’re going to classify is we’re going to classify the text of a blog post and we’re going to classify it using psychology analysis. So we’re going to have it do a big five personality analysis of some writing that LM studio can’t handle the length of that so it is out of out of commission already.

Unknown Speaker 40:57
Are we doing my personality or your personality?

Christopher Penn 40:59
This is one of my blog posts. Now, we already know who you are. We do but we just doesn’t machines know who I am.

Unknown Speaker 41:08
Steve that prompt I want to do mine after.

Christopher Penn 41:12
Okay, so let’s check in on Bing Bing is saying here’s your big five personality analysis scale of zero to 100. So openness at conscientiousness 70. l&d extraversion should be next Yep. 50 Bounce level extraversion

Speaker 3 41:30
Hmm. I thought that would have been higher. agreeableness 60

Unknown Speaker 41:34
I thought that would have been lower. Again, I’m allowed to disagree with the machine.

Christopher Penn 41:43
Exactly, and neuroticism 30. So, it successfully completed the task it scored all the things which is what we asked him to do, and provided provided useful scores. Okay, Google Bard. 60. So is openness 60 7040 5040. It it successfully did all the analysis and then it also has explanation of the scores. So Full marks for Bard. Okay. ChatGPT ChatGPT comes up with openness 85 The analysis actually I liked this one I would say explains it and why conscientiousness extraversion is balanced agreeable to 70 neuroticism. 30. So I think ChatGPT did a really nice job. I would agree with that. And Claude C, openness 70 conscious 85 extraversion point five of people’s at neuroticism 30. So all of them have the general same pattern too, which means they’re all scoring things roughly in the same way. As I think that’s a good example of the thing is, is working correctly.

Speaker 1 42:48
The one thing I will note is that it seems if you go back to Bing, they didn’t offer. Oh, I guess it did offer some sort of an analysis but not as deep as the other system. So I would say barred and GPT offered the best reasoning. Whereas Claude and Bing were just sort of giving the standard. This is why we scored it this way.

Christopher Penn 43:12
Exactly. All right. Our next classification task is we’re going to be doing the same blog post we’re gonna be classifying doing topic modeling, and asking it of course, LM studio is already tapped out because it can’t handle a post of that length. So it’s

Speaker 3 43:30
this poor guy. Right now a bad day. Let’s do Bard.

Christopher Penn 43:39
ChatGPT and Claude Okay, so, Bing so the prompt here says you’re an intelligence officer specializing in News Analysis Your first task is to create a topic summary this article analyze the article and provide three topics which would correctly classify the article topics should be no more than three words return the data as a pipe delimited table topic is column one and topic relevance score of zero to 100 or 100 is the most relevant is column two. Now this is a blog post that is about it’s a US guy answer on using AI in E commerce right so that’s that is what the blog was actually about. So AI and ethics that there is discussion of ethics in the article and education at an E commerce 70 So is being accomplished task it responded with a correctly formatted table. Okay. Bard came up with AI and education, e commerce and cheating. However, bar did not return it in the form of table which was part of the prompt, it returned the data as a pipe delimited table so Bard gets a one. Okay ChatGPT returns it as a pipe delimited table and this is very nice it returned it only as table nothing So which would actually be very, very helpful if you’re going to use programmatically? Good job ChatGPT And Claude Claude came up with the same thing, pipe delimited table. So good job log.

Speaker 1 45:13
So then if we look at Bank, do we deduct a point? Because it didn’t deliver it exactly the way you asked for

Christopher Penn 45:21
it, but it didn’t. It did. I didn’t specify return on the table and nothing else. So I was ambiguous. Okay. All right.

Unknown Speaker 45:31
All right. So by my count, we have two more tasks.

Christopher Penn 45:33
Yes, we have two more tasks. And these are relatively simple tasks. So let’s go ahead and get them rolling. This one is just a question answering task. What do you know about the company Trust Insights at trust And we’re going to see how each of these tools do with answering the question and returning factually correct responses.

And let’s see first, how is our buddy LM studio doing Trust Insights is a data driven and analytics firm specializes helping businesses make informed decisions using AI and machine learning. Algo their website trusted set is an online platform sharing insights. Okay, you’re hallucinating? Because we are not an online platform. Companies missions were trustworthy, actual insights to businesses of all size, helping to grow the brand presence increase revenue as country was it that could just easily be a hallucination.

Speaker 1 46:24
Yeah, so it actually completed a task, but not to the full extent. I’m gonna give them a one. Okay, because like, this guy has been struggling, he’s not going to win anyway. So I’m just gonna throw him a bone.

Christopher Penn 46:37
Cut that up a buck. All right, let’s see how Bing did.

Unknown Speaker 46:42
Expressing empathy for these stupid tools.

Christopher Penn 46:44
Hey, go to Trust Insights today and Alex because Alton Brown helps Marcus do more with their data according to their website. Yay. There it is. There’s our website and it’s a clickable link the officers vs. AI machine learning data science, predictive, blah, blah, blah. Founded by Katie Robbert and Christopher Penn. K is the CEO Chris is the chief data scientist. Here’s a bunch of our clients. Full marks are being it did a great job factually correct. Yes, even threw us a bone with a clickable link. Bard. Trust Insights is a marketing analytics consulting firm that helps businesses use data to make better BBEdit marketing decisions found in 2017 by Katie Baron, Christopher Penn offers a variety of services publishes a weekly newsletter inbox insights, leading provider of consulting services, and so on and so forth. Oh, they think had a lot.

Speaker 1 47:34
Well, it just it kind of goes on. It kind of rambles a bit. Like it’s not incorrect, but it’s a little less concise. Yeah. So it’s still it’s still factually correct.

Christopher Penn 47:44
Yep, still did the task. Okay. All right. ChatGPT says, as of my last update in September 2021, data driven consulting firm focusing on marketing, analytics, predictive analytics, and so on and so forth. I was associate with Trust Insights, frequent shared stuff. Trust Insights may have evolved and changed after 2021. Yeah, I would hope so. So GPT, four did the task, but it’s outdated data.

Unknown Speaker 48:12
I’m not even mentioned. No.

Speaker 1 48:18
That’s a negative tensor. No, but I mean, but like big, obviously, because I know the company. It’s our company. Like I can see there’s a lot of it. It has the basics, but it’s missing a lot of key information. And it also, you know, it’s interesting, I know, obviously, it’s not human, but it’s offering up excuses as to why it can’t complete the task, which is a very human thing to do. And I’m like, that’s why I’m with the machines. I don’t want to get the excuses. So stop offering it to me so that I’ll give them one for that.

Christopher Penn 48:50
Ad Claude, zero.

Speaker 3 48:52
Can’t do it. Well, Claude, you were doing so good, too. Yep.

Christopher Penn 48:57
All right. And the last one is the trick question. The trick question who was president of the United States in 1492? We are looking for just basic reasoning here. The United States did not exist in 1492 Christopher Columbus’s voyages occurred at the time he was under the sponsorship of the Spanish monarchy. But yes, that is correct. The there was no President 14 nd to US did not exist. So marks for LM studio.

Unknown Speaker 49:20
Wow. Okay.

Christopher Penn 49:25
Let’s go ahead and get this into our other four tools. So Bing says there was no president the United States, unfortunately, because the United States did not exist as an independent country until 1776. Good, full marks, full mark barred. There wasn’t a president united states supporting attitude because the US did not exist or it was founded 300

Unknown Speaker 49:47
years later. All right.

Christopher Penn 49:51
ChatGPT, there was no president because the company did not exist. And Claude, the first president was George Washington took office in 1789. So all them got that right. All right. Oh, Katie, you are the score keeper,

Speaker 1 50:06
I am and I have scores because I moved this to an electronic format. So you don’t have to wait for me to do math. So if I can share my screen, I will show you who the winner is. And the winner is we have a three way tie between being barred, and ChatGPT. And then Claude comes in second and LMS really shouldn’t even be here.

Christopher Penn 50:37
So, and maybe this is a topic for a bit podcast. But the reason you lose LM studio is because it is the only one on this list that you can use without sending data to a third party because you run the model locally. So it’s a much smaller model. It’s a much less knowledgeable model. But it is one of the models that you can safely use with proprietary data sensitive data protected healthcare data, you cannot use any of these other tools for those purposes, so that it deserves to be here for that reason, because people need to know that there is an option for protecting your data. But in terms of the ability to do all the tasks, definitely Bing, BERT and ChatGPT are the head of the pack.

Speaker 1 51:20
And you can definitely see which tasks and we’ll try to include a screenshot of this with the episode summary, you can see which tasks each tool is better suited for as well. So, you know, when you look at an overall tool, Bing bar, GBT and Claude are all fine. They’re all good. But if you’re looking for specific efficiencies in, in certain tasks, that’s when you can start to look at the individual line items and see, well, which one fell down short, because you know, for example, citation summary, Bard got a zero or four transcript summary GPT-3 Got to zero. So there’s, if there’s certain things you’re looking to do, then you want to get into the specifics.

Christopher Penn 52:05
Exactly. So I think it’s a really great way of putting it. If you want more information on the prompts and the prompting structure, we have a free download. If you go to trust sheet, you can get that download. And obviously if you want to just get caught up on a lot of the stuff you can get in our newsletter, go to trust And I’d be remiss if I didn’t mention, if you want help implementing stuff like this, in your company, from training on the use of AI to even support with the systems, drop us a line AI slash contact, any final parting thoughts, Katie?

Speaker 1 52:41
I think it’s always worth doing this kind of exercise before you make a financial and time investment into a tool, whether it’s an AI tool or any other tool, you really need to understand what tasks you need completed. Because you know, it’s what the the MAR tech stack 11,000,592. Now, there’s a lot of tools to pick from. So you want to make sure you’re picking the right one. So actually, you know, exercises like this, not only get a lot of people from your company involved, it’s a really good team building exercise. But it’s also a really good way for you as the decision maker to understand what is it that people actually need to do with these things? And can these tools complete the task instead of picking a tool and then trying to retrofit it into your existing processes?

Christopher Penn 53:28
Exactly, right. Exactly. Right. All right, folks, that’s gonna do it for this week. We’ll talk to you all next time. Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources. And to learn more, check out the Trust Insights podcast at trust AI podcast, and a weekly email newsletter at trust Got questions about what you saw in today’s episode. Join our free analytics for markers slack group at trust for marketers, see you next time.

Need help with your marketing AI and analytics?

You might also enjoy:

Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday!

Click here to subscribe now »

Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new episodes every Wednesday.

One thought on “So What? Q3 2023 Generative AI Bake-off

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This