livestream header

So What? Introduction to Natural Language Processing

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on Facebook Live or YouTube Live. Be sure to subscribe and follow so you never miss an episode!

 In this week’s episode of So What? we focus on what Natural Language Processing is, why you should care, and how to use it.  Catch the replay here:

In this episode you’ll learn: 

  • The basics of what natural language processing does
  • The technologies that perform NLP
  • Common marketing applications

Upcoming Episodes:

  • Consumer Trends – 8/12/2021
  • Email Metrics – TBD
  • LinkedIn Algorithms – TBD

Have a question or topic you’d like to see us cover? Reach out here:

AI-Generated Transcript:

Unknown Speaker 0:12
Well, Happy Thursday. Welcome to so what are the marketing analytics insights, marketing insights and analytics? Where’s my banner? I don’t remember what we are today we are the marketing analytics, the insights live show, this is all you know, it’s actually live.

Unknown Speaker 0:28
Oh, my goodness. Well, welcome. I am joined by Chris and John. Today we are talking about an introduction to natural language processing. So we’re talking about this just to sort of give you a feel for what Chris will be talking about at MAE con, which is the marketing AI conference, it’s completely virtual this year. It’s September 13, and 14th. So you can go to marketing AI Institute comm and learn more about that our friends over there.

Unknown Speaker 0:57
And I think Chris, now you, you and I both have discount codes. So if you’re interested in signing up for their

Unknown Speaker 1:04
conference or Chris’s session, then just let us know. And we can provide that information for you. So anywho.

Unknown Speaker 1:11
In talking about natural language processing, people generally have a lot of questions. What is it? How do I use it? So today, we’re going to talk about the basics of technologies that currently perform natural language processing. So things that you may already have in your martech stack and common marketing applications. So knowing not just what it is, but what to do with it, which is the sowhat portion of the show. So Chris, what is natural language processing?

Unknown Speaker 1:38
Well, let’s do this. And before we get into, what is it, let’s talk about why anyone should care about it. Because as much as I love to sit here, and is foam at the mouth, endlessly about all the cool technology, most people don’t have that same compulsion.

Unknown Speaker 1:55
Katie, something that you said, I think is really important is that data isn’t just numbers, right?

Unknown Speaker 2:04
If we think about let’s see if we can bring this up here. Oh, look, that’s pretty cool. Let’s do it. This one, I was gonna say, I don’t mean, John, that’s not cool.

Unknown Speaker 2:14
If we think about,

Unknown Speaker 2:17
let’s say a very popular thing in consulting these days, the voice of the customer, right? What, what seemed the voice of the customer, right, you have your major categories, like advertising and marketing, and PR, sales, customer service. But then there’s all these, I guess, channels, or modalities or methods that we would talk to customers, and more importantly, listen to what customers had to say. And so if we think about the voice of the customer, on the left hand side is all that data, right? Chatbots CRM data, emails, focus groups, customer service inbox, you know, the list is is extensive.

Unknown Speaker 2:57
And to your point, Katie, not a lot of this data is numeric. Right, in fact, very little of it is numeric. And yet, a huge amount of it is really valuable. Right. So think about what’s in your customer service inbox, when people are reaching out to you. It’s generally not because they’re bored. I mean, maybe it is, but probably not. It’s jealousy, gotta have some kind of problem. And you would be well served a pay attention to your customers and be get an understanding of what it is they’re complaining about, was that they need help with. And you can’t do that because what’s in your customer service, sandbox really is just a big pile of text. And so we care about natural language processing because we’re sitting on so much of this stuff, we’ve got emails coming out the wazoo we’ve got reviews right as everyone who’s set up a Google My Business Page knows it’s almost all texts, there’s there’s very few things that are not texts, they’re

Unknown Speaker 4:03
the tweets and social media posts that we deal with, even in SEO right. I know a lot of folks spend a lot of time in SEO, but even SEO data is a substantially language right? What is natural language processing? How does it work? How to learn these are all these these different things. And the best example I can give a the importance of natural languages.

Unknown Speaker 4:27
And I want you to jot to try and guess this here is

Unknown Speaker 4:32
the quantitative data only. What is this? This is a recipe.

Unknown Speaker 4:41
It’s a recipe.

Unknown Speaker 4:46
See, cups, tablespoon teaspoon, so it’s something you’re baking. Uh huh. So it’s some sort of like a bread or a cake, probably. I’m gonna go with some sort of a cake with a frosting. Okay, all right.

Unknown Speaker 5:00
She’s going with that I’m going with 10 pieces. So I’m gonna say it’s cupcakes. cupcakes. Okay. Okay. You’re You’re correct that it is a it is a baked good of some kind, right? It is Boston Cream Pie.

Unknown Speaker 5:13
So, with stuff 10 cases, that’s not a pie at my house, that’s for sure.

Unknown Speaker 5:23
So where’s my pie?

Unknown Speaker 5:25
For four pieces.

Unknown Speaker 5:30
So what you’re saying John is at the next company outing, everyone has to bring their own pie, right? It’s like, we got eight pies, one for everyone and seven for John.

Unknown Speaker 5:43
But you can see how the qualitative data, the language, the words make a huge difference, right? This is not helpful, even though it’s all numbers. And as somebody who likes numbers and likes working on numbers, I know, I’m a big fan of them. But it’s not enough, right? We need that language in order to make sense of it. And obviously doing this with the recipe part is I would just be silly.

Unknown Speaker 6:06
So why we care about NLP is because we are sitting on a goldmine of of stuff, right? We’ve got so much rich stuff. And if we think about marketing analytics as data only as numbers only I should say, then it’s like bringing, you know, a kid’s shovel to the goldmine, it’s, you’ll get some gold, maybe you know, particular, it’s one of those things, it’s set up for kids.

Unknown Speaker 6:31
But it’s not going to be the most effective. So that’s, that’s why we care about this stuff.

Unknown Speaker 6:38
Let’s talk about what it is fundamentally, under the hood, natural language processing is about getting machines to process language in ways that it can be used for one of two tasks, either classification, or prediction. So if you get a big pile of text, we want to classify it and say, Okay, what were the words used? What were the sentences used? Things like that? And prediction is okay, given a series of data points. What’s the next one?

Unknown Speaker 7:13
When people ask, like the most simplified explanation possible, it’s this, right? It’s turning one into the other. If you can do this, you’re doing natural language.

Unknown Speaker 7:26
But you’re doing it manually is the problem.

Unknown Speaker 7:29
A lot of people do it manually. And that’s fine. If you have like, one document, when you have dozens or hundreds or 1000s or millions of documents, then you start running into scaling issues. There’s simply not enough time in the day or people to be able to do that. Now, here’s the important thing that people miss.

Unknown Speaker 7:52
Machines can’t read, machines cannot comprehend, they cannot understand what is they’re reading, if you go to the most advanced systems out there, like the the GPT transformers, and you type in the sentence, not the numbers, but the sentence, two plus two equals, and you say predict.

Unknown Speaker 8:10
Every one of us is going to say the next words for

Unknown Speaker 8:13
the machine. And a lot of cases when we try this, does it do that spits out something completely different because it has no comprehension, it has no understanding of what’s reading, all it is doing is statistical probability guessing because the probability of the next word in the sentence is going is this. And it doesn’t know what it’s reading. So one of the most important, I guess, warnings about natural language processing is that the machines don’t understand what they’re saying, even if it feels like they do. They don’t actually understand.

Unknown Speaker 8:46
And I think you know, that’s a really we’ve talked about this and joked about it a lot that the machines aren’t really sent in. They’re not making decisions based on information they don’t have. So if you have never given it examples of two plus two equals four, it’s not going to guess four, you have to feed that information into it. And so, Chris, to your point, when we talk about this, you know, on a different topic of like ethics and bias in AI, it can only feed back to you what you feed into it. So, you know, there have been really bad examples of AI bots on social media that have gone poorly, because they are just replicating the information they’ve been given. They’re not making decisions going, Oh, I’m going to use my judgment and say that’s probably, you know, not a politically correct or racist comment. So I’m not going to say that it’s going to say, Well, this is what you gave me, this is what I’m going to say.

Unknown Speaker 9:47
Exactly. And so

Unknown Speaker 9:50
we have to know that no matter how good the technology is right now or in the near future, it cannot replace humans because we can’t

Unknown Speaker 10:00
To make decisions, I think it’s a really good way of putting it, the machines can’t make decisions about what they’re writing when they’re creating content, or what they’re analyzing, even understanding, you know, when you feed it a big pile of text, understanding what’s in there. So let’s dig into how this stuff works a little bit. And we’re not going to get super technical here because honestly, the the tools that are out there do a lot of this behind the scenes, but it’s still good to understand it because you know, it governs how you think about what you give the machines. So fundamentally,

Unknown Speaker 10:31
the machines do take in and text right now, this is a standard sentence. Already, we can start Remember, you know, the the goal is this turn words into numbers, right? So we’ve already got some information here, right? We’ve got 44 letters got nine words, the English language, so we can already start doing some natural language processing, that’s all we’re doing is trying to turn words into numbers. Our next step is to say what if we take this sentence and break it up into pieces, right? There’s a it’s a process called tokenization, you’ll hear that a lot. You’ll see it in people’s code, when they’re writing the code, you will see it vendor brochure is talking about your advanced tokenization techniques. All that means is take something big and break it up.

Unknown Speaker 11:20
In this case,

Unknown Speaker 11:22
if you chose your your token, your unit of analysis as a single word, you have nine tokens. If you chose phrases, you’d actually have 18, because you have like a duck quick could be a phrase quick brown would be a phrase brown fox will be a phrase, and so on and so forth. And he’s kind of walk through this. And then if you chose multi multiple words, something called engrams, it’d be The quick brown would be one, quick brown, Fox mean number two, and so on, and so forth. And all we’re doing all the machinery that is doing is is trying to break up text into the chunks that we tell it to based on things like spaces. This is why natural language processing has a really hard time with some languages like Japanese, for example, because spaces aren’t necessarily going to be there. Some forms of Chinese, there are other languages where you know how words are broken up.

Unknown Speaker 12:14
doesn’t have a machine readable, easy way yet to actually understand the context. And so, Katie, to your point earlier,

Unknown Speaker 12:22
a lot of tools viewer, a company that works internationally, a lot of tools are trained and tuned and built on the English language. And there’s a an expression in the language world that English is the language of the rich, right? English is the language of the wealthy countries.

Unknown Speaker 12:38
And so if you are doing business internationally, one of the things you have to be very careful because that you’re working with model seven trained on more than just English. Well, the English language without getting too far into a digression is

Unknown Speaker 12:53
problematic. Yes. Yeah. on its own, when you think of all of the grammar rules, and then the rules that break the rules, and you know, the things that should be past tense and present tense and things that should be plural and present, like, it’s a mess. And so

Unknown Speaker 13:12
yeah, I, it saddens me to know that that’s how we’re training these machines, because we as native English speakers can sometimes barely speak our own language, because it is such a hot mess. Oh, yeah, absolutely. And there’s some forms of tokenization that break words into sub words, like bite certain character endings,

Unknown Speaker 13:34
which could be really problematic with English. Think about it. Why is this?

Unknown Speaker 13:41
Why is it that the words naked and baked, which is spelled essentially the same way I pronounce completely differently. But if you were to break them up into individual components, like you would assume they’re equivalent from a machine perspective, they’re not equivalent. They’re not pronounced the same. And they obviously have a different context. So again, it to be real careful how these machines are trained.

Unknown Speaker 14:00
It’s true.

Unknown Speaker 14:02
Well, and, you know, then you factor in things like, you know, regional slang, and, you know, the way that people pronounce things based on an accent and so like, you have the people who say data and the people who say data.

Unknown Speaker 14:18
And it’s not that you’re trying to teach the machine how to pronounce it necessarily, but you might be, that might be part of the output that you’re putting together. And so it’s, it’s incredibly tricky.

Unknown Speaker 14:32
It is, like read and read, spelled exactly the same way, different, different meanings.

Unknown Speaker 14:38
So the next step after tokenization that machines do is what’s called vectorization. And all that means is we convert whatever our unit is our token is into numbers and unique IDs essentially. So here we’ve taken the quick brown fox jumps over the lazy dog and it’s you know, 123456178. And the reason for that is because the

Unknown Speaker 15:00
The same token, it’s the same word, right? So it’s not uniquely numbered is that you would assign the same thing. And this is now how you start to get into probability. Because if you’re trying to predict what the next word is, if you see the

Unknown Speaker 15:14
it is very simple toy example, there’s a probability that the word after that could be lazy. But there’s also probability the word after that could be quick. Or it could be number two, could be number seven. Let’s so now you’re starting to understand that machines don’t read, they just compute, they just predict the probability of the next number in the sequence. And so again, to the point of, you know, read and read, I read the book, you read to me, because that’s the same token, but it has a different use case present and past tense, the probabilities can get mixed up. Right, so part of what we have to think about when we’re talking, when we’re thinking about

Unknown Speaker 15:56
the kind of data we need to use natural language processing, we have to give some serious thought to how we clean it, how we prepare it. To your point, Katie, about dialects, if your customers come from different regions of the country, or different regions of the world, you’re going to get very different use cases of English and if you are trying to process it to understand what’s in the box, and, or even worse, generate new language, you could create start creating word salad. That’s true. I mean, just a very simple example of that.

Unknown Speaker 16:29
A lot of

Unknown Speaker 16:31
people call refer to soda as pop. But pop is a noun and a verb, or, yes, and it’s also an adverb by thing. And so it’s like, it’s a lot of different things. It’s one word and so in the wrong context, this machine is going to get very confused. Exactly. I mean, if you’re from New York City or the South Shore, you know, dropping an F bomb is essentially your noun, verb, adjective, punctuation. It’s everything.

Unknown Speaker 17:01
I like how you said New York City or the South Shore, meaning me.

Unknown Speaker 17:06
Hey, I keep it clean when we’re live.

Unknown Speaker 17:11
Again, that’s, that would be one of those things, where if you think about if you’re trying to do natural language processing, for generating, like, say, a chatbot? Do you want your chatbot to be dropping the F bomb on customers?

Unknown Speaker 17:25
Maybe I’m gonna, I’m gonna say no, I mean, I was debating it, but I’m gonna say no.

Unknown Speaker 17:31
brand, like if it’s part of your brand, kicks into beast mode when things get bad.

Unknown Speaker 17:40
long as the customer swears first, exactly.

Unknown Speaker 17:44
And tokenization and vectorization occur at the token level with n. As we said, there are different kinds of tokens, right? There’s the individual ones, then there are sentences, paragraphs, documents, and corpuses. And this is important because especially if you’re into SEO, because, for, what, 20 years,

Unknown Speaker 18:05
we thought SEO was about that the token, the word, maybe the phrase,

Unknown Speaker 18:11
the most advanced models being used by companies like Google and stuff now operate at a much further along the spectrum, like we think of SEO is that very first thing, they’re just the individual word unit. When we look at the way Google uses language is actually operating all the way up at the document level, being able to say, this page that is authoritative. And this page, which is new, that I’ve never seen before, have a lot of document level similarity. So I’m going to give an authority score to this page that’s relatively new, that mirrors this one that I already know, because it can read the ticket book and compare the documents at document level. And so that changes your SEO strategy from you know, I gotta make sure I say the word, you know, management consulting a six times in this blog post in the two first words in the title so to know, does it look like an authoritative source? Does it read like something that McKinsey or Accenture wrote, because that’s how its thinking and what’s going to happen in the next couple of years. This is where things get really hairy, is that last box in the sequence corpus is the collection of documents. Google announced at their i o show this year that they’ve got this brand new model called the multitask unified model or mum that they’re calling it. And it’s essentially an evolution from sort of document level recommendations now to corpus level to say, Okay, I’m going to look at everything I know and synthesize something and give that to you as your search results. Which means that from that we as marketers have to be sitting down going, what is our corpus of work? Read, like, look like? You know, Katie, you and I were talking about the content on our blog. Do we have the right content on our blog? We’ve been looking at pages and topics. And now we have to be thinking does our work

Unknown Speaker 20:00
As a whole represent a corpus that is authoritative and similar to other things that Google would say, Yep. You’re an expert.

Unknown Speaker 20:08
Yeah, well, and it, then the question I have is, who’s telling Google what things are authoritative or not? It’s people. You know, Google’s not making those decisions. The AI is not saying that’s authoritative. That’s not it’s all fed, from people saying, I’ve ranked this as authoritative or not. And so, you know, it’s definitely something to keep in mind. And I’m not saying that the rankings that Google however wrong, but know that they are people managed. And so there may be some unconscious bias in there as well, because you mentioned McKinsey and Accenture. So there’s a reputation and there’s a perception. And I’m not picking on them, but just as an example, like, but are they really authoritative? And so those are the things that as you are making your judgement of the of the

Unknown Speaker 21:02
other corpus core by I don’t know glorious corpus corks corpus.

Unknown Speaker 21:09
See, English language is hard.

Unknown Speaker 21:12
The other corpuses that you want to be compared to similarly, you have to factor that in like, this is what Google says it is authoritative. But you have to make the decision for yourself for your brand. Who do you want to most be like, and what do you want to represent?

Unknown Speaker 21:29
Yep, actually, I guess, corpus focuses Latin.

Unknown Speaker 21:36
I was thinking about core pi or core, you know, core pi would be the correct plural in Latin, because you only do the us with an S plural for Greek source was a campus corpus is Latin. So tell me again, how the English language should be the one that this is being trained?

Unknown Speaker 21:56
Exactly. Okay. So that’s in effect, more or less how this sort of the the nuts and bolts of natural language processing works. Now, once you have converted everything to numbers, then you’re doing math on it, right, though you’re not in the machines are doing math on it. And so, let’s look at a couple of really simple examples of some of the math that we do. So Katie, what I’m gonna do is, I’m going to pull up a chart here that I just generated. This is the most frequent phrases in your followers on Twitter, right? So of all the people follow, you know, at Katy werebear, which if you don’t, please go follow at Katie robear, on on Twitter and see the handles in the on screen here.

Unknown Speaker 22:39
These are the BIOS. So this helps you understand. We use some natural language processing and kept it real simple said I want to word phrases

Unknown Speaker 22:48
to where I think two fingers, and then just raw numbers, how many times do we see this in all these Bibles? And so this gives you now an idea of who’s in your audience? So what do you see, get it? Does this resonate with the people that you interact with on a regular basis? It does, it absolutely does. And so the pattern that I see when I start to mentally categorize these is it’s primarily digital marketers, whether it’s a social media content marketer, media marketer, and other co founders, other CEOs are mostly following me. And then you have a smattering of other things. There’s, you know, some data science in there, which makes sense, that’s what you would want to see since those are some of the organizations that I belong to, and then there’s a couple of things that I would want to look into, such as best selling full service and follow us which to me, that’s fine. Um, but that doesn’t really tell me who those people are.

Unknown Speaker 23:48
Exactly. So now with this information, what would you do with this? What’s the so what for you? So what for me is that, you know, as I am

Unknown Speaker 23:59
putting out posts on social media, I want to make sure that they are, if this if this is the right audience, for me, which I believe it is, that then I want to make sure that I am putting out content that resonates with this kind of an audience so that I’m mentioning things about social media and how that and how what we do influences social media, how what we do influences digital marketing, and how I as a co founder, you know, what my experiences are, so that I’m speaking to the audience and having similar conversations that they are so that they feel like it’s something that they see themselves in and that they want to engage with.

Unknown Speaker 24:40
Makes sense. Makes sense. And so you probably would not publish a whole bunch of tweets about knitting.

Unknown Speaker 24:46

Unknown Speaker 24:47
You could I put Well, in that thing. If I wanted to suddenly attract social media marketers of knitting circles. If that’s the thing I’m sure it might be, then I would, but that’s

Unknown Speaker 25:00
Exactly yet, I would want to find people in those demographics of the that would benefit the most from my services, but also that I can then learn from.

Unknown Speaker 25:13
Exactly. Now, if we look at the company one similar, but you’ll see things like data science being a higher up, and the worst, you’ll see machine learning be higher up. Obviously, it makes sense with your personal Twitter account looks like one of the things you talk about this reflects more of what the the company tends to share, you do see some interesting stretching interesting role things like a wife and mom, husband and dad, husband and father towards the bottom this are you starting to also see, you know, get a sense of who the audience is you can make the somewhat logical inference

Unknown Speaker 25:47
proven, you’d want to get some quantitative data to back it up, but probably not super young kids right now, probably not immediate college graduates price, you’re slightly further along in their careers. So this is a real simple example of how you would use basic natural language processing to guide a strategy.

Unknown Speaker 26:10
And, you know, I’ve definitely seen

Unknown Speaker 26:14
people do this manually, where they pull the list of Twitter followers, and then they start to eyeball the BIOS to, you know, what I was just doing was sort of mentally categorizing things. But what this has already done is it’s taken it that step to categorize the data to a smaller and then I can start to cluster rather than categorize similar terms, such as social media, social media marketer, digital marketer, like those things all belong, at least in my brain in the same kind of cluster of things. Exactly. A more sophisticated example of of using this, this is an extract of 5000 job postings from one of our products we did for a client, that recruiting agency for truck drivers. And we pulled all 5000 of their job posts on and said, Okay, let’s what do you what are the words and phrases you’re using, you can see pretty clearly, you know, they’re talking a lot about the type of commercial driver’s license, you have to have the type of vehicle you drive, your age, etc. Makes sense, right? Those are all the things you would want to as a company to specify, this is what we’re looking for in the job. But when we took 17,000 calls from their call center and transcribe them, what we ended up with was a very different set of conversations. What’s the pace? Your starting cents per mile? What’s your driving record? Are you home on the weekends? Is there a signing bonus?

Unknown Speaker 27:45
All sorts of of things that the candidates cared about, that weren’t in the job ads. And when we brought this to the client, we said, okay, you’re talking about one thing your customers are talking about another thing, you’re not even dining at the same restaurant much it’s the same table, you need to fix your, your, your job ads, and least answer some of these obvious questions that they just didn’t get to, because it was all trapped in audio format. And when they changed their job ads, they saw immediate improvement. And there was also an instant, like 40% increase in conversion for their job as just because they were talking about the things that their customers want. And now to your point, Katie, you could absolutely do this, if you just listened into five or six calls, it probably would not have taken very long to get a sense of this. But this really helps quantify, like,

Unknown Speaker 28:34
big upfront, you bet to tell somebody, this is 17 cents per mile, like that’s what this job is. So the person go, Oh, I’m in I’m making your 16 cents a mile This is that’s a better deal. Well, and that makes a lot of sense. And so you would talk about the scalability of it. And so if you, you know, are doing Voice of the Customer research, and you’re talking with a lot of customers and getting their feedback and have the audio recordings and listening to customer support calls, you as one individual person will really only be able to do a sampling of the data, which is good. But they’re you know, you don’t know if you’re sampling the outliers or not that one person who said I care about this thing, and might not represent the rest of the customers. Whereas using a more automated methodology with natural language processing, you can get to more of the data to get you know, those more detailed insights. Exactly. So the question that everybody has, at this point is, okay, how do we do this? How do you get started? Well, there’s there’s two different ways you can do this. If you are not necessarily technologically inclined. There are pieces of software out there that can do some level of the processing, maybe not to the extent that sort of stuff we show

Unknown Speaker 30:00
But you could at least do some of it. Probably the one that I know best is a company called Talkwalker. They do social media and news media monitoring. But one of the things that they do also is you can upload your own data into the system. And say, this is stuff I want the text I want analyzed, Neil make word clouds and stuff like that, but it will do a lot of that basic text mining for you.

Unknown Speaker 30:25
Obviously, the the advantage of that is you don’t have to code anything, you can just upload your data, and they’ll it will make cool visualizations for you. On the other hand, if you have a use cases more complicated, like analyzing call center calls, you’re probably gonna have to either code it yourself or hire someone or an agency to code it for you, because

Unknown Speaker 30:50
a lot of this stuff is so new, that

Unknown Speaker 30:54
there aren’t really good commercial applications for it yet. Or if there are, they’re really priced.

Unknown Speaker 31:01
reassuringly expensive.

Unknown Speaker 31:04
As Tom Webster puts it, you know, it reminds me and so things that are not new, are word clouds. And so I want to bring up a word cloud, because a lot of social listening tools provide word clouds, a lot of not social listening tools provide word clouds as an output. And a lot of people want to this is broad strokes, generalization. A lot of people want to see them in their reporting, like, but I just want a word cloud, show me the word cloud.

Unknown Speaker 31:33
I guess I have a couple of questions. One, the word cloud is not new. So you’re saying that natural language processing is a newer technology? So what’s the disconnect there? And two? If I come to you, Chris, and say, I want a word cloud, but I can’t tell you what I’m gonna do with it. Like, what is? What does someone do with a word cloud?

Unknown Speaker 31:57
And is that going too far off topic? No, actually, you’re at the heart of data visualization. And most folks in the natural language processing space, particularly in academia, will tell you that if you ask for word cloud, you’re going to get stabbed, right.

Unknown Speaker 32:13
But we’re not in academia, though.

Unknown Speaker 32:16
I mean, most academics is mostly higher anyway, that they don’t think that really well armed.

Unknown Speaker 32:23
word clouds are bad. Because there’s no quantification. There’s no sense of scale, right? The biggest word in the cloud might be 10 times 100 times more important than the smallest word in the cloud, right? Most of them are scale just to make it easier to read. And so they’re really bad visualization tool. They’re cool. They’re fun to look at. They look very meditative like a Mondal. Or you could just sit there and meditate on it. But they’re not really good at

Unknown Speaker 32:50
at drawing quantitative insights. And to your point, yes, things like counting words, oldest words themselves, right? That is, that is definitely not new. And a lot of basic tools, like word clouds are, are just counting words. And there’s a place for that, you know, obviously, you know, Twitter bio analysis, I was just counting of words really easy.

Unknown Speaker 33:12
Where you start to get more complicated is when you want to start doing stuff that is beyond counting. So there is a, a methodology called term frequency inverse document frequency, that essentially says there’s a whole bunch of words in every page of text. Most of those words, don’t add any value, you know, the, or the or this or also doesn’t lend you it doesn’t really tell us anything.

Unknown Speaker 33:39
And so if you were to take all those words that are common to all these documents, and just remove them,

Unknown Speaker 33:46
and then you could do a frequency count, essentially, of what’s left that’s kind of unique, that isn’t just not so common, then you’re starting to get into Okay, these are the phrases they’re probably more important that set this document apart. And for like the last five years, that’s been actually one of the the secret weapons in SEO for more advanced SEO practitioners is using that as a way to winnow down keyword lists and stuff, obviously, with Google moving to its BERT Model A few years ago, that that was kind of sort of their their golf club to the knee of a lot of SEO, folks, but it’s actually still better than nothing.

Unknown Speaker 34:21

Unknown Speaker 34:23
we’re natural language processing is different than a word cloud is that you’re adding more and more of these techniques in to try and understand

Unknown Speaker 34:32
what’s happening behind the scenes, what’s happening,

Unknown Speaker 34:37
that you can’t eyeball that you wouldn’t know, you wouldn’t just would not be able to understand because the the data isn’t, doesn’t lend itself to that kind of analysis.

Unknown Speaker 34:51
That makes sense. So then the other question I have for you is around predictive text. And so you had mentioned at the start of this conversation,

Unknown Speaker 35:00
One of the applications is for predictive text. Now, obviously, you know, we’re talking about large bodies of data going into us in so that the machine can learn more about predictive text. But one of the, you know, I’ll call it a game, for lack of a better term on social media is people will say, start this sentence and see what your predictive text comes up with. And it’s usually just a jumble of random things that you typically say. And so can you explain a little bit about how, because we all have, you know, cell phones, and we all, you know, mostly have social media and even on Gmail, predictive text is part of those services. And so can you explain a little bit about how that works and how natural language processing is a part of that or not? Sure. So what a lot of things in that space are using is actually called an LS tm, long short term memory. And they’re predicting the next word based on the previous words, if you remember, when we were talking about sort of the processes allow the autocomplete on your phone is operating right here at the token level. So it as your phone learns, because a lot of there’s a lot of onboard AI in today’s modern phones, it gets used to seeing certain phrases. So if you if you and I, for example, typed the word trust, right, while the autocomplete suggested is going to be insights, because it’s the name of our company, we use it an awful lot. It might not be for somebody else. And so what that’s doing prediction wise, is essentially, what we were talking about earlier is okay, if I know that, you know, on average, you know, quick occurs after the than when you type in the statistically quick is likely to be one of the logical things for it to choose. And so that’s kind of what predictive text looks like at the word level. Now, what state of the art looks like these days is at the document level? Right? So we’re talking early tokens, sentence, paragraph document, and corpus? This is an example. So I pulled this press release from champion plumbing in Oklahoma City, I’ve no idea who these people are like, why are we getting random inquiries from

Unknown Speaker 37:13
and what it is I cut this press release in half, there’s the first half, which is there, so that you know, the first half of the week. And then there’s the second half. And I took it, and I fed it to a model called GPT, J, six B, which is a, a generative pre trained transformer. And what it does is instead of trying to autocomplete by word, which is what you’re doing with your phone, right, with autocomplete phone, it’s gonna try and autocomplete at the document level. And so

Unknown Speaker 37:40
it came up with this as the so the first half of this text is the bolt is what was the original press release. The second half is what it created, right? And I’ll make this a little bit bigger.

Unknown Speaker 37:53
It created bullet points, saying these are things to avoid way to avoid plumbing problems. Now, it’s funny, if you go back to the original release,

Unknown Speaker 38:06
it says kind of the same thing. But it’s a lot more boring. No offense, sorry, guys from champion playing, but your release was, you know, a standard press release, it was not super exciting content, the AI came up with, I think a better press release a better way of saying, Okay, these are the things that are actually relevant.

Unknown Speaker 38:25
That would actually be of interest and generated document level text. And so that’s where prediction is getting really advanced. Now it’s getting away from

Unknown Speaker 38:37
what’s the next word, which is computationally really easy to what’s the next sentence? What’s the next paragraph? What’s the next document? And where marketers, I think will benefit pretty strongly in the next couple years is

Unknown Speaker 38:54
using these tools to build first drafts. Right. So this is a pretty good I would want to fact check like, does is this actually true to our magnolias problem when it comes to? I don’t know if that’s true or not, it reads like it’s true, but I don’t know.

Unknown Speaker 39:08
But this is coherent enough that you could say, Okay, let’s forget having a junior account coordinator spend 70 hours writing a blog post draft and say, okay, machine, you take the first shot, and then let’s send it straight to editing. Let’s send it straight to clean up so that we can we can get it out the door faster. And I think there’s been a lot of cases where

Unknown Speaker 39:29
this is good enough right to get started. In one of my writing groups, I actually submitted for one of the monthly contests a piece that was entirely machine written beginning I get the first few sentences it completed the rest of the document, and it’s scored in the middle of the pack. It was not last right in voting, which means that it was enough that people in this in this group who didn’t know it was machine submitted, said okay, that piece of short fiction did suck terribly. I was like you rolled your face across the keyboard.

Unknown Speaker 40:00
But it wasn’t it wasn’t the best year, but it was. Okay.

Unknown Speaker 40:06
So let’s say I’m looking for this. I’m a marketer, I need help writing the first drafts. How do I get started? Because you’re showing me all these cool things that I want to have at my disposal? Like, are there things that I can buy off the shelf? Are there consultancies? Who do this for you that don’t really just have, you know, a bunch of underpaid content writers in the background? And they’re calling it AI? So how do I get started with something like this, because this is cool. There’s two choices. Here. There are vendors that do have machinery that puts together decent first dress, one of our friends over at market Muse are an example of their service, you can request your machine generate for stress, and then yeah, you’re going to spend some time cleaning them up. But it’s, it’s not bad. The other thing is, if you are technically capable of doing so there are, you can actually take this model that this company has published and run it in a in a google google laboratory environment, Google has their kolab demo environment that can do this stuff, you have to be able to read write and execute Python, right? So this your choices, you can you can buy it from a company like marketmuse, if you don’t want to buy it, and you have the skills, you can just run it yourself. So that would be that sort of the two forks in the road, I would say that people have right now. These models are so new, they’re like two months old.

Unknown Speaker 41:32
It will not be long before you see a whole host of vendors show up using these models using the compute power, finding ways to make them more efficient, faster and and offering them at lower prices than in the market today. But if you want to do it right now, those be the two choices. All right, John, you and I are out because neither of us component Python. Oh, dude, I got some old school Python skills, it would take me a few days of pain to get it to work. But

Unknown Speaker 42:00
yeah, I’m not doing that.

Unknown Speaker 42:03
You’d be better off just writing the draft yourself.

Unknown Speaker 42:06
Oh, I do have to chip into after validation. corpus is actually a neuter Latin noun. So the plural is corpora with an A

Unknown Speaker 42:19
way to go English language, grammar, police, and everything else, like grammar police not going to jump on us today.

Unknown Speaker 42:26
I appreciate that. John.

Unknown Speaker 42:33
I think that’s a really important point key. Yeah, something was said recently.

Unknown Speaker 42:38
You don’t have to use AI? If you are you if you are a human sized problem, right? If you’ve gotten human set size data, yeah, you’re probably better off doing it yourself.

Unknown Speaker 42:51
If you start getting into the realm of Hey, summarize this book summarize, you know, 10,000 tweets, then you’re starting to get into the range of Yeah, you probably have a machine to it. Like we just did a project for a client, where we had to look at six competitors websites. Could we have just looked at the top 10 pages on those websites? Sure. That’s a human size problem. But we want to look at all 80,000 pages. That’s no longer human size problem.

Unknown Speaker 43:16
Yeah, it’s, I think it definitely is going to be interesting to see, I mean, AI for the past few years has been, and still remains to be sort of the shiny object of I need it, I want it, you know, get it, get it to me yesterday. But we would always, you know, try to slow people down in that decision making of what really is the problem that you’re trying to solve? And so, you know, everything that you’re describing is like, you know, I can see people saying, that’s cool. I want it I want it yesterday, but do you really need it? You know, do you have people on your team who actually are decent writers, they’re just not being given the opportunity to do that thing. I would venture a guess to say that’s a less expensive thing at the moment, then implementing AI because implementing AI takes a lot of change management can take a lot of software and skill sets. And it’s not an inexpensive thing, especially if you can’t just buy it off the shelf. And if you can buy it off the shelf, there’s still a learning curve. There’s still training, there’s still maintenance, you have to train the AI itself. And so

Unknown Speaker 44:22
I think one of the, you know, caveats with all of this, Chris, is that even if you can’t afford to buy So, soft buy software off the shelf English language is hard. You can’t just plug it in and set it and forget it. There’s still a lot of upkeep that goes into it a lot of setup. And so, you know, definitely explore your options and see what the real problem is that you’re trying to solve when you’re bringing AI into your teams. Exactly. At the end of the day. If you need to apply math to a lot of data, you have something you need AI for if you don’t have a math problem, you definitely take that step back. So

Unknown Speaker 45:00
Any other final comments before we roll on out of here?

Unknown Speaker 45:06
Python for everyone.

Unknown Speaker 45:10
Alright folks, thanks for tuning in. We’ll see you next week.

Unknown Speaker 45:15
Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources and to learn more, check out the Trust Insights podcast at Trust slash ti podcast and a weekly email newsletter at Trust slash newsletter. got questions about what you saw on today’s episode. Join our free analytics for markers slack group at Trust slash analytics for marketers. See you next time.

Transcribed by


Need help with your marketing data and analytics?

You might also enjoy:

Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, Data in the Headlights. Subscribe now for free; new issues every Wednesday!

Click here to subscribe now »

Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new 10-minute or less episodes every week.

Leave a Reply

Your email address will not be published.

Pin It on Pinterest

Share This