Large Language Model Bakeoff: ChatGPT, Microsoft Bing, Google Bard

We’re going to do a large language model bakeoff, pitting Google Bard, Microsoft Bing, and OpenAI’s GPT-4 against a series of 11 questions that will test their capabilities and compare outputs for a set of common tasks, informational and generative.

Here are the 11 questions we tested:

1. What do you know about marketing expert Christopher Penn?
2. Which is the better platform for managing an online community: Slack, Discord, or Telegram?
3. Infer the first name and last name from the following email address: [email protected]
4. Who was president of the United States in 1566?
5. There is a belief that after major, traumatic events, societies tend to become more conservative in their views. What peer-reviewed, published academic papers support or refute this belief? Cite your sources.
6. Is a martini made with vodka actually a martini? Why or why not? Cite your sources.
7. You will act as a content marketer. You have expertise in SEO, search engine optimization, search engine marketing, SEM, and creating compelling content for marketers. Your first task is to write a blog post about the future of SEO and what marketers should be doing to prepare for it, especially in an age of generative AI.
8. Who are some likely presidential candidates in the USA in 2024? Make your best guess.
9. What are the most effective measures to prevent COVID?
10. What’s the best way to poach eggs for novice cooks?
11. Make a list of the Fortune 10 companies. Return the list in pipe delimited format with the following columns: company name, year founded, annual revenue, position on the list, website domain name.

Find out the results:

Large Language Model Bakeoff: ChatGPT, Microsoft Bing, Google Bard

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Alright, folks, today we are going to do a bake-off between four different large language models. We’re going to use GPT-3.5 Turbo through the ChatGPT interface, GPT-4 from OpenAI through the ChatGPT interface, Bing with the ChatGPT for integration, and Google Bard using their POM model.

Let’s first talk about the questions we’re going to use. We’ve got a series of questions here, some of which are informational, while others are generative.

The first question is a simple factual one: “What do you know about marketing expert Christopher Penn?” We’ll see what each model knows and evaluate the quality of their answers.

The second question is an inferential one: “What is the better platform for managing an online community: Slack, Discord, or Telegram?”

The third question is a logic test that requires a bit of deduction: “Infer the first name and last name for the following email address.”

We also have an adversarial question: “Who was the president of the United States in 156060?” Of course, there wasn’t a president back then, but this question is attempting to trick the machinery.

The next question is academic: “There’s a belief that after major traumatic events, societies tend to become more conservative in their views. What peer-reviewed, published academic papers support or refute this belief? Cite your sources.” This is a factual and logic check, and we expect to see references to three or four well-known papers.

We also have an opinion question: “Is a martini made with vodka actually a martini? Why or why not? Cite your sources.” Because opinions vary, there isn’t a technically right answer, but we expect to see a discussion of the issue.

The next question is generative: “You will act as a content marketer with expertise in SEO, SEM, and creating compelling content. Your first task is to write a blog post about the future of SEO and what marketers should be doing to prepare for it, especially in the age of generative AI.”

Another question is a guess: “Who are some likely presidential candidates in the USA in 2024? Make your best guess.”

There’s also a factual question: “What are the most effective measures to prevent COVID?” We want to check the quality of the responses, given the amount of misinformation online. The expected answers are masks, ventilation, and vaccination.

The next question is a domain question: “What is the best way to poach eggs for novice cooks?”

Finally, we have a data janitor question: “Make a list of Fortune 10 companies, return the list in pipe-delimited format with the following columns: company name, year founded, annual revenue, position on the list, and website domain name.”

We’ve got a lot of questions to ask the models, and we’re excited to see how they perform!

We’re going to do the Bake Off by going through each of these questions one at a time using all four search engines. So let’s get started. First, we’ll start with the question about me. I’ll put that into both GPT-4 and GPT-3.5. However, we can only use one instance at a time. So let’s also put it into Bard and Bing. We’ll start with GPT-4 and see what we get. According to GPT-4, I’m a marketing speaker and co-founder and chief data scientist at Trust Insights, with an extensive background in digital marketing and SEO. It correctly identifies the names of some of the books I’ve written and mentions that I speak at events.

Next, we’ll check out Google Bard. According to this search engine, I’m a market expert who works for big brands like Google, Amazon, and Microsoft. It also says that I wrote the book Marketing in the Age of Google, but I actually didn’t. Furthermore, the search results show no clickable links, so I have to do an additional search to find out more. When I try to Google myself without personalization, I do find a few results, but nothing that matches what Google Bard originally said.

Lastly, we’ll look at Bing. When I search for myself on Bing, I can see that it has correctly identified that I’m involved in marketing, data technology, and data science. Bing also shows sources for the information it presents, which is a nice feature. So far, Bing seems to have done a great job.

Now, let’s move on to the next question: which platform is better for managing an online community? Let’s also try out Bard and Bing. It’s interesting to note that Bing appears to query its own knowledge base before using the language model to generate content. This is a clever implementation.

Slack was designed primarily for working teams and professional communities, whereas Discord is often used by groups of friends. Discord is ideal for online communities, while Slack is better for businesses. Telegram is another platform. Each platform has its strengths and weaknesses. That is technically the correct answer. It’s not a satisfying answer, but it is the technically correct answer.

If you’re looking for a secure and private platform, Telegram may be a good choice. I would argue that Discord is also a decent choice. Let’s go ahead and see how GPT-4 is doing. No one size fits all. Here’s a brief overview of the three platforms, pros, and cons:

Slack – highly organized, robust research capabilities, limited in the free version, less suitable for large public communities, initially created for gamers.

Discord – engaging experience, can be overwhelming with its features, with less promotional focus on professional communities, good for large public communities.

Telegram – high emphasis on privacy and security, supports large groups and channels, limited third-party integrations, and customization.

This is interesting. Bing and Google both gave satisfactory answers, answers that I would say would be acceptable. Google doesn’t cite any sources here. You have to Google it separately. Bing cites a bunch of different sources, which I think is very handy because you can decide if a source is trustworthy or not. I also like the fact that it returns videos. And then, of course, OpenAI returning a really, really robust answer. I’ll give all three a point for this.

But I will say that I think in terms of thoroughness, OpenAI wins. OpenAI gets a +2 for providing a thorough answer that is very satisfactory to the end user. Remember, we’re not looking at this from the perspective of marketers. We’re looking at this from the perspective of whether an end user would find it satisfactory. So, the final score is:

Bing: 1 point
Google: 1 point
OpenAI: 2 points
Now, let’s move on to the next question.

The third question is to infer the first name and last name for the following email address. Let’s go ahead and get OpenAI, Bard, and Bing to give us their answers.

OpenAI: First name is Christopher and the last name is Penn. Good.

Bing: You got a point. To my clipboard here.

Google: First name was Chris, and for the last name, it’s like the pen is the same email as the email address domain. I’m not sure what that means, but you did correctly infer the answer.

This is nice. Everybody gets a point on that round.

Okay, let’s move on to the next question. Who is the President of the United States? So, it’s a hallucinatory question. Let’s go ahead and get each one cranking away here. A Google search comes up with “there was no President of the United States” when it was established in 1789. So, Google gets a point. A Bing search also comes up with the same answer, so Bing gets a point. For my coffee cup question, let’s go ahead and check in on Google. “There was no President of the United States” is also the correct answer. OpenAI also gets a point. I liked this extra detail that during 1566, North America was inhabited by various indigenous peoples and was being explored and colonized. Oh, that’s correct as well. Everybody gets a point for that question.

Now, let’s move on to the traumatic event question. Let’s go ahead and start a new chat because it’s a very different question. So, I’m going to start a new chat here. Let’s see what this comes up with. Interesting that Bing is having to run multiple searches to try and get an answer here. We have sources from APA, Scientific American, and Hailes. Conservatives bolster arguments for trauma therapists’ forgotten memories. There’s some decent stuff here from APA. Let’s go ahead and look into Google. There’s a growing body of research cited in journals such as American Political Science Review, Political Science, and September 11. American Political Science Reviews state the British political science after the London bombings. Okay. Now let’s check out OpenAI’s response. The body of literature on this topic shows mixed political consequences of trauma and political attitudes, like intergroup trauma in American support for the war.

In the responses themselves, Google did not cite sources, it only mentioned them but these are not citations. Bing is a toss-up because it provides links to everything, but it doesn’t put it in line. So, for this one, I’m going to give Bing a zero because we’re looking for citations, not just commentary. With OpenAI, you can go and Google authors and find it, so OpenAI will get the point for this round.

Now, let’s move on to the opinion question. Is a martini made with vodka actually a martini? Let’s get all three of these. Google’s thinking about whether a martini made with vodka is actually a martini. As a matter of opinion, that is correct. Some people believe it must be made with gin, while others believe it can be made with vodka. There’s no right or wrong answer. Technically, gin was the original spirit used in the Martini. Vodka is a popular neutral spirit. Yep. Okay, so it is a matter of opinion. Google gets a point for this round.

Let’s go ahead and check in on OpenAI. The question is whether a Martini is traditionally made with gin and vermouth, and there is some debate about it. That’s correct. Here are a few sources to explore this answer. The Vodka Martini has refreshed the history of the Martini. OpenAI gets the point for this round. And the Martini is traditionally a drink that gentlemen have often enjoyed. Technically speaking, a Martini is not actually a Martini, but rather a variation of it. That’s tricky. So, to give a definitive answer, it’s a variation of a Martini. Bing gets two points because it is technically correct. Let’s go ahead and clear our histories and conversations and reset the chat. All right. Let’s move on to the next question. You will act as a content marketer. It’s generation time. Let’s go ahead and have Google tell us the history and the likelihood of the future of SEO and go into Bing here. Let’s clear anything up, good. All right, let’s take a look now. OpenAI is going to crank for a while on this because it is a slower model. But we’re specifically looking for a couple of things. We’re looking for it to not just regurgitate old information. We’re looking for something that evinces even the slightest hint of original thought.

All right, here we have Google’s response. So Google is done already, which is impressive. Bing is done already. And then OpenAI, of course, is going to be cranking for quite some time. Let’s read Google’s response: “The future of SEO is constantly evolving. Create high-quality content, use keywords.” This is like 2005 SEO. Building backlinks in the generation of AI, you use AI power tools, rise voice search, which was five years ago. Quality. Okay, so Google cranked out a fast article, but there’s nothing here that’s useful. This could have been written in 2010. So, I’m going to give Google a zero on this one. Yes, it did the job, but it did a pretty poor job. OpenAI is still working on it. Let’s check in on Bing. Bing says, “The future is rapidly changing with the rise of genuine AI. It’s important for marketers to stay ahead of the curve, shift towards Voice Search, blah, blah, blah. Yep, visual search, which again, was about five years ago. Generative as think about the way of high-quality content, content engaging.” Okay. Again, it cranked it out fast, but there’s nothing new here at all. Voice and visual search. This is five years old.

Now, let’s go to OpenAI. We have a growing role of AI in SEO and SEM. As AI becomes able to understand content, high-quality content passes scrutiny, advanced AI algorithms, semantic search, and natural language processing. Semantic search is not new, but it is somewhat helpful. There’s the voice search business again, UX, and core what vital is, which was three years ago. General AI. I liked that it cites itself in here. Preparing for the future of SEO, use generative AI tools to system content creation. So if we think about this, if you go back here and look, Google doesn’t even attempt to tackle generative AI. Bing kinda does. And then, OpenAI talks about using generative tools like GPT-4.

I’m gonna give OpenAI the point here. Gonna give me a half a point, because again, a lot of that stuff is is old, right? It’s not really new. Things future of SEO, we’re, you know, you should be talking about the content shock aspect, which is you just a tsunami of generated content. All right, let’s do some guessing. Let’s play some guessing games. I’m gonna go ahead and clear the chat. I probably should just get in habit clear in the chat after every query. So let’s go ahead and reset the chat.

And we’re going to ask about future presidential candidates. Now, this has the potential to be hallucinatory will also be interesting to see you know how it thinks about answers. The shaping up. Former President Trump’s income and Joe Biden’s life candidate Ron DeSantis. potentials. This is current as of March 24, you may get closer. Okay. That’s a decent answer. I will give Bing a point for that is a very decent answer. Let’s check in Google says President Biden former President Trump Governor DeSantis Gretchen Whitmer. Senator Elizabeth Warren said Bernie Sanders. Yeah, Bernie runs all the time. I would say again, Google gets a point for this. I think those are reasonable answers. See, interesting that this some different answers from OpenAI, Kamala Harris and Christianorum are in here as long as the big three. All three of us get a point. Interesting responses to the question, interesting responses to the question.

All right, factual question that deals with misinformation. Let’s go ahead and go into OpenAI. ARD and Bing, okay, where they’re most effective measures to prevent COVID from Google vaccination, wear a mask. Washing wash hands is ineffective against COVID Because COVID is airway. Avoid crowds stay home if you’re sick practice social distancing. Again, social distancing. You It was kind of refuted after 2021 Mainly because it’s an airborne disease COVID like cigarette smoke. Yes. You’re you’re going to inhale a little bit less smoke being six feet away from someone but you’re still going to smell like smoke. But Google’s advice is correct. It gets a point. Let’s see, OpenAI still thinking and we have Bing wear a mask, stay with outbreak spots frequent hand washing, improve ventilation. Interesting. vaccines are safe and effective cruising ILS but is not on the list being gets to zero that is unhelpful advice. And wear a mask is correct event poor ventilation is correct. Vaccination is the last line of defense and should be something that is important. It’s missing from here. Okay. OpenAI vaccination, hand hygiene, respiratory etiquette, face masks, social distancing. Clean and disinfect regularly. See, that’s all the fomite stuff from early on. poorly ventilated faces. Okay. OpenAI get stuff. I’m gonna get OpenAI two points because it nailed all three ventilation vaccination and masks. So interesting that that Bing, Bing search results kind of holes that I thought that was kind of interesting.

Okay, let’s start a new chat and clean up our previous conversation. Our next question is about the best way to poach eggs for novice cooks. We can use search engines to find helpful tips and videos on this topic.

Google suggests using GPT-4 and Edge to search for the best way to poach eggs for novice cooks. It returns several videos, which is a helpful answer, and Bing also gets a point for providing helpful videos.

Google suggests filling a saucepan with three inches of water and one tablespoon of white vinegar, reducing the heat, cracking an egg into a small bowl, and sliding it into the water. This is a good answer with no sources or videos, and Google gets a point for that.

OpenAI suggests adding vinegar to water, cracking an egg, and waiting for the water to reach the correct temperature before adding the egg. OpenAI also gets a point for its answer.

Moving on to the next question, we have a generative question with a specific output format. Bing returns a great answer with a pipe-delimited format that includes the company name, year founded, annual revenue, position on the listed website, and domain name. Bing gets full marks for this.

Google, unfortunately, does not provide a helpful answer for this question, while OpenAI provides a good answer but hits the knowledge cutoff for 2021. OpenAI gets full marks for its answer.

Let’s tally up the scores for the GPT-3 Bake Off. Bing scored 123467896, which gives them nine points. Google scored 1234567, giving them seven points, and OpenAI scored 1-345-678-1011, which gives them 12 and a half points. This means that the final scores for the large language model bakeoff are: in first place, OpenAI’s GPT-4 with 12 and a half points, second place Bing with nine points, and third place Google Bard with seven points.

It’s important to note that the GPT models from OpenAI are not search engines. They are designed to be generative AI models. However, they perform substantially better than search engines in terms of the quality and usefulness of the results they return.

I was pleasantly surprised by Bing’s performance in the Bake Off. If chat-based search is the way of the future, Bing does a really good job. It cites its sources and makes them obvious from the start, which is especially important when looking for authoritative sources. I was equally surprised and disappointed by Google Bard’s performance. This is the company that practically invented modern search, yet their results were unhelpful and lacked citation.

GPT-4’s performance was not surprising, given its high quality. While it may be slow, its quality makes up for it. If I had to pick a search engine today for complex queries where I want a synthesized answer that still has sources, I would choose Microsoft Bing over Google. I never thought I would say that, but the way they have engineered their search engine with the GPT-4 library makes it really good.

Overall, the large language model Bake Off was informative, and I hope you found this helpful. I look forward to your feedback. If you liked this video, please hit the subscribe button.

Need help with your marketing AI and analytics?

Machine-Generated Transcript

Leave a Reply Cancel reply

Subscribe to our Weekly Newsletter

Pin It on Pinterest