So What? The Marketing Analytics and Insights Show

So What? How do I clean and prep my data for analysis – part 3

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on Facebook Live or YouTube Live. Be sure to subscribe and follow so you never miss an episode!


In this week’s episode of So What? we focus on bringing the data set into machine learning software. We walk through more advanced feature engineering and algorithm selection. Catch the replay here:

So What? How do I clean and prep my data for analysis - Part 3

In this episode you’ll learn: 

  • More advanced feature engineering
  • Obvious and non-obvious errors in your data
  • Choosing an analytical approach for your data

Upcoming Episodes:

  • Prep your data for analysis part 4 – 2/18/2021
  • How do you benchmark a website’s performance? – TBD
  • Understanding your knowns and unknowns – TBD

Have a question or topic you’d like to see us cover? Reach out here:

AI-Generated Transcript

Katie Robbert 0:06

Well, hey, we’re alive.

This is so what the marketing and analytics insights show? Today we’re on part three, how do I clean and prep my data for analysis? So the past two, two episodes, the past two weeks, what we’ve covered is really going through that exploratory data analysis process.

So you can start to get organized, do those business requirements.

Last week, we covered actually extracting the data, figuring out what questions you need to ask.

So you can go back and see where was my hypothesis, right? Am I answering the question being asked, and this week, what we want to cover is actually getting your data into some sort of machine learning analysis.

So we’re going to cover more advanced feature engineering, some obvious and non obvious errors in your data.

And then choosing an analytic approach.

You’re probably thinking like, what do you mean an analytic approach? Like you just analyze the data? Well, Chris is going to cover what that actually means.

So Chris, john, take it away.

All right.


Christopher Penn 1:11

as Katie said, we’ve been looking at the exploratory data analysis process.

And one of the things that we highlighted in previous weeks is that at any point in the process, if you are doing stuff, and you realize, Oh, this is not working, you actually can, you know, jump back a step, jump back, several steps sometimes even have to jump back all the way to the beginning.

And so, this week, we’re gonna be looking at the data we’ve worked on for the last two weeks, which is our Twitter analytics data set.

And we’re gonna find spoiler alert some problems with the data and stuff and and see what we can do with it.

So to recap, what we’ve done so far, is we have gotten the data from Twitter itself from

And you’ve done some initial just cleaning, making sure we know what’s in the data.

So let’s take a look at it just very quickly here.

Inside the data, we have the tweet ID, the permalink, the text of the two of my tweets, the time the number of impressions, engagements, the engagement rate, retweets, replies, so on and so forth, all the fun of what comes out of Twitter.

From there.

We’ve done some basic feature engineering, so things like identify like, making sure all of our text is in lowercase.

Because when you’re doing any kind of text analysis, most algorithms and things are case sensitive.

So even something like that is just forcing your text to lower cases is a good thing to do.

And then we know, generally what I tweet about.

So we’ve put together some calendars to see, like, how many times we in a tweet, do I reference SEO or data and datasets that it creates a in the table, more columns for those numbers? Now, here’s where things get interesting.

Let’s go ahead and start running.

Just a quick topic check just to see how many times am I talking about each of these things? And what are the frequencies of these words, when we run this, and it really is just a word count? Right? We see a lot of things like a book a consultation for free, right? Well, one of the things that we do in our, in our Twitter scheduling, is have repeated calls to action stuff, we try to do that anywhere between 10 and 20% of our social media content to make sure that we’re we’re getting people’s attention.

This is good for business.

But it’s bad for analysis, because that doesn’t really help us understand what we’re talking about.

Right? So one of the things that we should probably do is take that down to just one.

So instead of having every single promotional tweet, which is not a great idea, what if we take that and turn it into just one of each of the promotional tweets? So let’s go ahead and do that.

We’ll take a call this a tweet text.

Katie Robbert 3:59

Oh, Chris, while you’re doing that, I just, I apologize if I missed it.

But I have a question.

So you said that you’re doing this word count, so that you can try to figure out what you tweet about the most.

Now I know, with text mining, a lot of times you have to provide a dictionary of sorts.

So you have to like prebake, in the words that it’s looking for.

Did you do that in the sense, you said I want to look for SEO, I want to look for a data science, or is this script actually just finding and counting all of the similar words?

Christopher Penn 4:32

That’s what exactly what this step is, is to look at those the frequency of those terms.

And now for example, the book of consultation stuff has largely disappeared, right?

Katie Robbert 4:43

My question was, did you give it a dictionary beforehand to tell it what to look for? Got it.

Okay, so that is the difference here.

You’re not doing exactly mining, you’re actually just doing frequency.

Christopher Penn 4:54

That that’s what it is.

And it’s a form of text, but it’s the simplest possible form.

This is just so classification, it’s a question of like, what’s in the box, and what words and phrases are in the box.

And going through this now we can see that our instincts and our our judgement about the things, the topics that we’ve all pre tagged, we’re on target, right marketing and data and analytics and data science that makes logical sense that, you know, obviously, I would hope that I would know what I’m tweeting about if I don’t, that’s a different strategic problem.

But this is one of those things where that’s a non obvious problem, you may look at your data set and that initial, just peek at the data, you look at go, Oh, this looks fine, right? These are all my tweets, and I assume there’s no problems with this data.

But when you then do even just that little bit of analysis, you go, huh, that’s a problem, right? That’s that that’s something that is is kind of messy.

So we’ve now gone through and we’ve cleaned that up, and we should have it just make sure I am where I think I am.

Yep, we should have 59 different variables that we could use for modeling.

Now, again, one of the things that, and this is why it’s so important to to have some experience to grow some experience in data science is that very folks who are early on in their journey, we’ll rush right into Okay, now let’s feed this to, you know, IBM Watson and the auto AI feature and see what comes out of the box, right and kind of rush in and do stuff.

And it’s a really bad idea here.

Because this, this dataset is still not ready for primetime, there’s a lot of things, if data is an ingredient, there’s a lot of extra stuff in here that we don’t necessarily want in our in our final recipe, right? Like plastic spoons, like in the cake mix, like so we left the plastic spoons in the box.

So we got to finish all that stuff.

We got a fish all that stuff out.

So this is now where we start getting into more advanced types of cleaning.

So let’s take a look at the first one.

And in numeric data, there’s this concept of the near zero variable.

And this is a variable, which doesn’t change a whole lot, either is zero, or it’s very close to zero and doesn’t fluctuate very much.

And that’s not really helpful.

It doesn’t really predict anything, right? If If you have, in my case, number of promoted tweets, zero, I don’t spend any money on Twitter, right? So having those fields in here doesn’t make any sense.

But we have to know that they’re there and clean them out.

So in our next set of steps, we’re gonna say, Okay, let’s get rid of all those fields that are ours, zero fields, right, because that’s a just a bad thing to have.

And what we end up with, is out of the 59 fields, we started with a down to 20, right, we just chopped away 39 columns in our spreadsheet, just cut them off because they’re full of zeros.

And that includes some of the topics to like in this corpus of tweets, it looks like I have not really tweeted about SEO, because it just it just gone.


So at this point, we have to go back to our requirements, right and say, Is this a data problem? Like I could have sworn I tweeted about SEO or something, right, a lot in the last month? is a data problem? Is this a processing problem? What kind of problem is this? If you know that something’s not there that you thought was supposed to be? There you go, huh?

Katie Robbert 8:32

So let me ask you this question.

Because this, this might be part of the challenge that you’re running into.

So obviously, if I’m understanding correctly, this data, it’s only reading the tweet itself.

So if you are tweeting about SEO, but the title of the article, or the tweet itself doesn’t contain the term SEO, it’s going to appear as if you’re not actually tweeting about SEO.

So there might be a little bit of a mismatch in the copy that you’re coming up with to put on social media, even though you’re the article or the blog post, or whatever it is, is the right topic.

Is that is that right?

Christopher Penn 9:12

That’s correct.

And that really goes into this section here on feature engineering, if we think there’s a problem and we validate the data itself is correct.

We have to do augmentation, right? We’d have to say, Okay, if I’m sharing links, I know that they like, you know, Search Engine Journal, right, which is exclusively about SEO.

And it’s not and I’m not tweeting about it, because I’m pulling in the article titles and the article title doesn’t have, you know, common terms like structured data, SEO, ranking, SERP, etc.

Then maybe I need to augment my dataset with a copy of the actual article so that we can check the articles themselves for those terms.

That we’re not going to do that today because that would take a lot longer than we have on the show.

But that is the kind of question we need to be asking this to say okay.

When you’re doing data analysis, it’s it’s the old Rumsfeld thing, right? You know what, you know, the known knowns known unknowns, unknown knowns and unknown unknowns, right? The sort of the two by two matrix.

And this is a case where you know, that something’s supposed to be there.

And it’s missing.

Because it’s unknown unknown.

Like, why is this happening? And it’d be one of the things we wanted to try and fix.

So that’s what we’ve done.

Just in this stage is clean up this news, your variables, then quality check, look at the data set, go.

Yeah, that doesn’t seem right.

It seems like something’s off.

Katie Robbert 10:34

I do remember.

This is one of those like weird moments.

I do remember exactly where I was.

The first time I heard you explain that two by two matrix that No, no, the known unknowns, the unknown knowns and the unknown unknowns if probably got that wrong, because I just remember sitting there wide eyed, saying, what the heck did he just like, what is happening? This is all gibberish.

I don’t know what this means.

But now getting more familiar with it.

I understand what you’re talking about.

But I I remember very distinctly exactly where I was what conference room I was sitting in what chair I was sitting in, because my like I was that little emoji of like, my brain just what just happened? I do think, you know, if, if anyone has questions about, you know, that two by two matrix, feel free to stick them in the comment section.

But you know, I feel like maybe at some in a future episode, we should probably cover that two by two matrix, because it is until you’re really into it and can like really break it down.

That’s a tough concept.

John Wall 11:42

Is that the Rumsfeld does? Yes, I’m the first guy to throw that out.

Christopher Penn 11:46

Yeah, that’s why Oh, he’s the first guy.

But he was the one who was most prominent for, for the interesting explanation of it.

John Wall 11:54

So So net lifetime,

Katie Robbert 11:56

before we move on, and this is something that we had talked about, at the end of the show last week, we do have a comment.

It says, you know, Chris, put this stuff in Excel more people understand what all of that is.

And so I do think that, you know, while this seems a little tongue in cheek, it’s probably worth mentioning that, you know, Chris, you can definitely put some of this in Excel and do it with more advanced excel commands.

However, it will take you longer and the room for error is higher.

And so a lot of what you’re doing is automating those processes.

But you know, there is a bit of this that you can do in Excel, you don’t have to do this in our

Christopher Penn 12:37


And in Excel, if you’re going to do that, you would probably end up having to write some custom functions yourself what you can do, Excel has a really good built in programming language.

So as long as you’re comfortable coding in inside of excels environment, you will be able to do a fair bit of this, you will, we will reach a point fairly soon where you can’t where you do need some kind of statistical language.

And that would be now.

Katie Robbert 13:04

And that is now Exactly.

Christopher Penn 13:08

The next check we want to do anytime we’re dealing with numeric data is look for things, especially when we know we’re trying to find something we’re trying to find an answer.

In this case, we’re trying to find what gets me traffic from Twitter, right? What drives Twitter clicks is the question that we’ve been working on the last two weeks.

There’s this concept called linear combinations, right? linear dependencies where a piece of data is embedded within another piece of data.

So in this case, engagements, for example, combines retweets, and likes and comments, and so on, and so forth.

And so if you’re building any kind of model, where you’re going, you want to figure out what is driving this outcome, you do not want linear combinations, because it will totally hold your model.

And so in this case, we ran a quick check on it, when you see that there are none that the software is able to detect.

And then the last checkpoint is to ask ourselves, are there any two high correlates things with a correlation, a Spearman correlation above 0.95, which means the two values march in lockstep, they’re essentially the same thing.

You see this a lot with Facebook data, in particular, Facebook will give you things like post views, total views, organic views, and stuff.

And those are like 98.99% correlation, because they’re essentially all the same thing.

There are variations on you know, some Facebook accounts, but for the most part, they’re pretty much the same thing like organic post, reach and post reach.

Unless you’re spending a crap ton of money on Facebook, your numbers are probably going to be fairly close.

If you’re doing only organic social media on Facebook, those numbers will be identical.

So we want to eliminate those correlates.

Because again, that’s one of the things that will screw up your model.

Katie Robbert 14:57

So Chris, and Now that we’re getting to the point where you’re going to then put this into some sort of machine learning software, if I recall correctly, a little while back, IBM had, basically some sort of a software where you could put in your social media data, and it would, in broad strokes sort of predict what would lead to engagement, but you had to know what you were doing.

You had to know, you know, what was contained within your data, you couldn’t just throw raw Twitter data in there and say, what leads to engagement?

Christopher Penn 15:33

That’s correct.

That’s IBM AutoAI.

AI and everything that we’ve done up to this point, creates a clean data set for prediction.

So you do have the option at this point to be able to say, do I want to do it in a coding environment? If so, continue on, do I want to change languages, and once you’ve made the data set, at this point, you can say, Okay, I’m ready to take this data set and do something else with it.

That includes tossing into into IBM Watson, the auto AI software, because the auto AI software, and this is true of all the software on the market, not just IBM is still very naive in the sense that it can do some basic feature engineering, try and figure out what’s going on and work towards your objective.

But it’s it lacks that human touch it lacks knowing, for example, which correlates to get rid of it may highlight, hey, you’ve got some tightly strung correlates, or you got news your variables here, but it won’t delete it for you.

Because the software does it.

The designer said yeah, we don’t want to go messing with your data without your permission.

So it can highlight problems, but it can’t fix them.

And so you still need to go back and do all this tuning, no matter which platform you use, if you just try and throw your raw Twitter data in, like you said, it’s not gonna get anywhere, and particularly the feature engineering, like we’re trying to find topics, like what am i tweeting about? It can’t do that at all.

So that’s something that still has to be done.

So there’s still a long way to go for automated AI.

Katie Robbert 17:05

Got it? Now.

So you’re using social media data.

And I feel like social media data is that currency that a lot of people understand because it has those general terms of likes, engagements, and reshares, and tweets and all those things.

You know, I’m thinking about other, you know, datasets that you could potentially use to do this kind of prediction.

So, you know, for example, you and john have the marketing over coffee podcast? Could you do something similar? To find out, like, what is likely to cause someone to download an episode? Or subscribe to the newsletter? Could you do something, you know, along the lines of what you’re describing, with this Twitter data?

Christopher Penn 17:49

Absolutely, you could.

So if you were, if you were to go over to marketing over coffee calm.

And you look at just things like the show titles, you could even easily detect, like, you know, shows, you know, there typically is a, you know, a, a topic, a width, and then a proper noun.

And those the shows with guests who could easily differentiate guests show versus non guest, Joe, it would be interesting to do that classification.

And see like, does having a guest predict more traffic or less traffic on marketing over coffee would be a good strategic question to ask, you have the show descriptions, right.

So you can obviously do the same type of topic tagging.

To see, you know, is this a show about sales is a show about brand, things like that, you have all the things you’d expect, like date, the amount of traffic that the individual posts gets the number of downloads each episode gets.

And so there’s a lot of different ways to do that kind of extraction, you have to figure out again, going back to the our exploratory data analysis process, what is the thing that we’re trying to find? And what are the requirements we need to define it?

Katie Robbert 19:03

Makes sense.

So kudos, john, for being consistent in your naming conventions.

That I mean, but in all seriousness, that probably helps this type of analysis go a long way, if you’re being consistent.

And so if you’re thinking about using this on your own content, or you know, something similar, you know, set yourself up for success when you’re developing these things and try to stay consistent with your naming conventions, because when you try to do this kind of analysis, it’s going to be harder to get to the root of the answer, because you have to clean out all the junk and normalize everything.

Christopher Penn 19:37


And again, even with marketing over coffee with podcast data, there’s going to be external data you may want to bring in, you may want to bring in Neil, for example.

With a lot of podcasts, particularly longer running podcasts, like marketing over coffee, people aren’t going to go to the website, they’re going to go to to Apple, they’re going to go to Google podcasts, stuff like that.

They’re going to search For the show the shows name in search results, and they’re not going to click on the website, they’re going to go to the recommended, you know, platform like Google, Google will highlight.

In fact, let’s let’s try that because I’m curious, let’s do marketing over coffee podcast.

And so you can actually, you can even see like this, you can play it right from the search, right? So you don’t have to go anywhere.

Google has done that.

So in a case like this, you’d want to augment with Google Search Console data just to see how many people are searching for the brand name, and then how it Alif does the site come up.

But you also would want to use an SEO tool of some kind like href, just to get the number of searches for the brand, where your site doesn’t come up.

If your podcast is named something pretty obviously, you know, the marketing, podcast, whatever, you’re going to want to track that branded search because your site may not come up at all, if it doesn’t come up, it won’t show up in Search Console.

So you want that search volume from a more broad looking SEO tool.

And so you again, have to augment some of that data.

Katie Robbert 21:09

Which makes sense.

So john, I hope you were taking copious notes, I expect this report on my desk tomorrow.

But Chris, to your point, have to go back to the Twitter data that you were starting with, there might come a point where the tweet itself isn’t enough information.

So you then have to augment it with the actual text of the content that you’re linking to as well, provided, that’s what you’re doing.

If you are only sending tweets with no links, and you’re not including those keywords in your tweets? Well, that’s a whole different issue.

Christopher Penn 21:42


Likewise, you could also bring in but if you have enough of the data, you could be bringing in things like opens and clicks from your newsletters, right? Because you have a newsletter promoting show.

I know john has been doing a lot of promotion of the show through text messaging, right.

So bringing in some of that data, whatever data you can get your hands on, you’ll want to bring in and then run these initial analyses, then use your variables and stuff, get rid of obviously the things that don’t matter.

So that you have a clean data set of all the things that could predict whether or not you’re what predicts your final outcome.

Katie Robbert 22:18

So where do we go from here?

Christopher Penn 22:20

We go straight into running into machine learning.

Actually, that’s not true.

Where we go next is a really important part and is the analytic approach.


And this is where this is where subject matter expertise and domain expertise within data science, it really matters because at this point, we’ve done pretty much all this stuff, right? We have to decide what kind of model we want to build, right? We know this is numeric data.

So there’s, you know, four fundamental types of models you can build, right? There’s a regression model for for numeric, there’s classification.

For Non numeric, there is dimension reduction for non numeric unsupervised, and then there’s just a hot mess of, of other like smaller models.

But fundamentally, we want to know, do we have a supervised learning model, we and it’s Numerix, we’re going to run some kind of regression model.

Inside that landscape, there’s probably a dozen different ways to do regression modeling, right? There’s ridge and lasso and straight linear logistic, there’s gradient modeling gradient descent a gradient boosted? Which one do you choose? which one is right for the data? The answer to that question is where a tool like IBM Watson AutoAI will really come in handy because you can and people do test this out manually.

You run, you build a model against, you know, 15 or 16 different versions.

And you say, Okay, I’m going to try every single model, and see which one, at least from the just the first pass is able to give me something that was vaguely resembles an answer.

That’s pretty inefficient.

All things considered, it takes a long time to do that.

What you want to do is have machines do at least the first part of that research to say, okay, machine you go through and at least is tell me which of these 50 models of models is the best general fit for my data.

So if we were to look in, in in SWOT and Watson AutoAI, let’s go ahead and go away a tour.

I know how to use it.

Katie Robbert 24:38

If you don’t know how to use it, then taking the tour is probably a good idea.

John Wall 24:42

It is seven.

There’s some new features here, I’m sure.

Katie Robbert 24:48

But I think Chris, what you’re continuing to reiterate is, you know, people want to jump ahead to like the quote unquote fun stuff, which is running the machine learning model, but there’s so much prep ahead That’s really what we’re trying to help the audience understand is that, yeah, you can jump right to the machine learning model, but you’re going to get, you know, garbage in, garbage out.

If you take the time upfront, you can continue to run these models over and over again, way more efficiently and more cost effectively.

Christopher Penn 25:17


So in this case, all that prep that we did creates an output file, you know, from our environment, we can then put that cleaned up ready to go file in any system, right, Python.

If you want to try and do some of these regression analyses in Excel, you could do that.

Continue with it in our, in this case, I’ve stuck in IBM on the Watson AutoAI system.

And I have Okay, what do I want to predict? Well, we remember from a while back, we said we want to predict URL clicks.

And so it gives you the option if you want to do additional tuning here.

Or you could say, okay, go and you go and get a sandwich.

And later, it’ll, it will come back and say, okay, I’ve here’s all the things I’ve tested here is every possible variation, I’m gonna try and do some new numeric feature optimization, I’m going to try and do hyper parameter optimization, but choose a dozen different algorithms, and it will eventually spit out something.

I do this with a data set, just to see which algorithms in thinks in general are the best fit.

And then there’s some additional tuning that I want to do that I know the platform isn’t designed to do right now.

One of the things you can do with Watson AutoAI is you can then say, okay, send us back to a Jupyter Notebook where you can tune the code yourself.

Unfortunately, it doesn’t Python and my Python skills are terrible.

So I’ll use this as a diagnostic tool and then go back into R, which is the language I choose to work in and take its recommendations and use them in R.

Katie Robbert 26:47

So I’m gonna, I’m gonna stop for a second, because you just throw out a lot of terms that I just want to make sure that you know, everyone is on the same page about.

So what is a Jupyter? notebook?

Christopher Penn 27:03

A Jupyter Notebook is an interactive environment, we can write and run Python code right within a page.

Katie Robbert 27:09

And Python is a programming language similar to our it’s just a matter of preference, they both roughly do the same thing.

Christopher Penn 27:17

No, Python is a much bigger language.

Python is a general purpose programming language.

So you can like build video games, and stuff like that.

Whereas R is a statistical computing language it is is optimized and tuned for stats and machine learning only, you cannot build a video game at r right? You can absolutely do that in Python.

Python is much more flexible language.

It is easier for beginners to pick up.

The challenge I have with it is I don’t like its syntax.

I am old and I grew up in C and c++ and Python syntax does not resemble that in any way shape, or form.

Katie Robbert 27:50

That was going to be my next question is what is it most similar to in terms of programming, so C and c++? If you’re familiar with programming languages, then that reference will make sense to you.

It is definitely not worth us digging into the details of what all of those things mean.

But I just wanted to make sure that there was a couple of terms that were explained.

That might be newer, but you’re seeing on the screen, Chris, that there’s definitely some terms that you’ve already talked about, such as feature engineering, was just on the screen relationship map.

That to me, if I had no idea what that meant, I would say Oh, that might mean the relationship between variables, and which variables are likely to influence one another, for example, I might be wrong.

But I think that those are the kinds of things that if you go through the tour of AutoAI, AI, those things will be explained.

And so definitely don’t just jump into using Watson or any other AI machine learning software, without really knowing what it is you’re looking at.

Christopher Penn 28:56


That’s and that’s the really is the critical point is that it’s going to do a lot of stuff here for you.

But you still have to have some general idea of what it’s doing.

So right now, it’s picking algorithms, right? So here’s, here’s one, there’s an extra trees regressor, which is a decision trees.

And here, it’s got another one random forests, the top two algorithm that it was going to use to try and classify this data.

It’s going to create pipelines, which are just ways of saying, Okay, I’m going to try and making different sets of the data and running tests on those sets and see what they come up with.

And then from there, it tries all these different techniques to do the feature engineering, so sigmoids, transforms, cubes, cubes and cube roots and stuff to try and figure out is there some mathematical combination of mixing and matching the data together, that gets to the best possible outcome based on the objective we’ve given it and the measurement of accuracy that we care about.

And again, this is why this tool is like this is so valuable, because this would literally Take you 10 hours to do by hand, right.

And here we are only four minutes in, it’s already, it’s already built now on the eighth pipeline, and it’s doing hyper parameter optimization on on half these pipelines so far.

Number six is the is the winner so far, let’s take a look at number six, and dig into it.

So this is a random forest.

We have our all of our different measures.

And then we have our variables, right.

And so from this analysis, Watson is saying the things that predict URL clicks the best so far, our impressions, how many people see my tweets, detail expands how many people clicked on piece of media in the tweet, and then URL per user profile clicks, people who clicked on my profile.

Now, here’s the problem.

This is on a scale of zero to one, right, it’s basically it’s almost like a percentage.

And, like, most forms of regression, you know, anything below point two, five, is kind of noise.

Point 252.

point five is moderate correlation.

And point five and above is strong as a strong signal.

All of these are way below point two, five.

So what this is telling us is that there isn’t anything in this data, even with all the feature engineering, we’ve done so far, there isn’t anything in here, there’s a strong enough predictor that would say, yes, you can make a decision based on this data.

So we’ve done a lot of work.

But the answer is not there.


And so this is something that, as you do data sciences is one of those things, you have to be aware of it, sometimes you’re gonna, you’re gonna, you know, swing them as you’re going to strike out in this case, with this data set, as it stands, we have struck out these, you know, none of this is good enough to take to the bank.

Katie Robbert 31:54

Man, I mean, three weeks of this build up, and then this is what we get.

So the question is, Chris, and this probably happens more often than not, even if you don’t take it as far as bringing it into machine learning, the data that you have doesn’t answer the question.

So where do you go from here? What do you do now?

Christopher Penn 32:15

We start over.

So the key is this, we The goal of the strategy is still good, right? We know, we’re also trying to figure out what drives what makes people click on tweets, right? So we have a, we have a clear goal, we have a clear strategy, we’ve collected a reasonable amount of data, we’ve done a classification and done our initial analysis.

And what we’re finding now is that we don’t have the data that we need, right? It’s there’s nothing in this data set that fills that.

So our requirements, means we have to go back to data collection and say, Okay, what other data could we bring in? Or what other data could we engineer that would answer these questions? So earlier, we were talking about what about the text of the articles themselves? Maybe even if the the tweet doesn’t have our keywords in it, maybe it’s the you know, something in the articles themselves? There’s things like we point the parts of speech that we have not brought into this at all.

So for example, how long is the tweet? How many words does it use adjectives or adverbs more than other parts of speech, things we could engineer into this dataset from the data that we have, you know, for example, there, if you tend to use a lot of adverbs and adjectives, you might be using more descriptive language.

So just descriptive language matter.

We all know, those wonderful clickbait headlines, right, you know, five things that predict this, or it’s only one, you know, one thing will drive him crazy or whatever.

It’s all languages decomposable.

So we could we could extract that information and build on it, as well.

But we also have to acknowledge the possible that the the possibility that this may not be an answerable question, right, at least not with the tools and techniques we have right now.

It is possible that there may not be an answer that we can get to today.

And after a few more attempts at this, if we keep doing this, we may find Yeah, we can’t answer that question today.

Not with not without, you know, some some some things that data, we don’t have more data we’re not willing to invest in.

You could, for example, spend a lot of money and run surveys, ongoing surveys, your Twitter audience say like, you know, what kinds of topics you want me to tweet about and stuff.

I’m not going to do that for this project, because it’s not worth it.

But if it was mission critical, we absolutely would do that.

So for at the end of a process like this, where you completely and totally fail.

You have to figure out what’s what are the next steps.

There’s also point from a goal and strategy perspective.

And this is where you know, Katie, someone like you as an executive would be involved and say, Okay, guys, stop.

It’s a waste of resources.


You know, at this point, you’re hitting diminishing, you’re hitting, diminishing returns getting no returns.

Katie Robbert 34:55

Yeah, you are now just wasting money.

Well, and I think You know, especially if you’re looking at social media data to try to figure out what’s working, the thing that you don’t have in your data set is how the algorithm of that particular platform works as well.

And so is it serving it up? Based on, you know, your previous number of engagements? Is it serving it up? based on how many followers you have? Is it serving up based on time of day? Is it randomly serving it up? Is it, you know, like we talked about on our podcast earlier in the week? Is it excluding your information based on some random bias in the algorithm.

So that’s also an unknown piece of information.

So, you know, as you continue to like banging your head against the wall, trying to figure out social media in particular, keep in mind that you’re never going to have the whole story.

Christopher Penn 35:50


And this is true of marketing data in general, unless, until the day comes where we have, you know, chips embedded in people’s heads or function telepaths, we will not know why somebody made the decisions they made will not be reaching their heads, they have to tell us, right.

So again, surveying focus groups, customer advisory boards are all super important.

They’re part of the journey to AI because you need them, you need that qualitative information to fill in the gaps.

You cannot get that from from what we’ve been talking about here.

And we can see clearly in, in our experiment, there is no answer why in any of this data, we can get to what and we can get to strong associations, but there is no why.

So this sort of wraps up the technical portion of our process.

And so in our next steps, we have to figure out how do we move on from here?

Katie Robbert 36:39

We have to figure out the so what

Christopher Penn 36:42

it is because there is no So what?

Katie Robbert 36:45

Well, so john, what’s your takeaway after all of this? Yeah, what

John Wall 36:49

goes back to so many business problems that we have in that the only way to find out what’s going on is to dig in and invest, you know, there’s kind of this cultural thing of like, never fail with any project.

But the reality is, there’s no substitute for actually digging in and finding the data.

And there’s value in knowing that you never need to go down this path again.

You know, just saying that, okay, we know, it’s not just we don’t want to go to Fiverr and buy 50 million views, because that’s not going to, you know, increase engagement, we know that with our best quality audience, it’s not making a difference.

So yeah, you know, it’s definitely, you don’t get the accolades, or there’s not going to be an award that you’re going to get from some random publishing company that nobody really cares about.

But you at least do know, you know, one less place to go hunting.

Katie Robbert 37:37


So before we wrap up, Chris, there is one more comment that I think is worth noting.


So might be worth trying dummy variables to classify tweets.

What do you think about this approach?

Christopher Penn 37:50

So this is this is part of where we were going with that idea about using basic parts of speech.

So you can decompose a dummy variable is when you take something as a categorical or non numeric variable and transform it into a numeric variable of some kind.

The normal technique in machine learning is called one hot encoding.

But in this case, you know, there are multiple ways to get at that.

So you, we tried the topic approach, and that really didn’t yield very much the way results, we actually could do things like even just number of nouns, number of verbs, things like that, and really blow out just the sheer number of variables we could create from this and see what comes up, right? Just to look for ways to to do that segmentation.

And if you find something, you could also then change analytical approaches, right? So what we’ve done is regression based, which is we’re looking at, you know, what has a correlation to URL clicks, we could take an approach and do a classification approach where we’d say, take the top 10% of tweets or whatever.

If you would do a graph of the number of tweets and find the inflection point on on the declining curve, you’d classify everything above that point is like top performing tweets like MAE Top 5%, top 3%, whatever the number is, and you’d classify those as top performers.

And you classify everything else is not top performers.

And then run a classification algorithm instead of regression algorithms say, Okay, what do all of the top performers have in common that is not in common with the bottom performance? So that’s a good secondary approach that you could take to try and again, classify what’s going on.

But I would go that definitely go do the dummy variable or the the imputation route first.

Okay, what else? How much more? Can we squeeze out of the language that we’ve got going here, right? Because this is my Twitter account, I could look I could even bring in the number of URL clicks on the individual URLs themselves, right, because we have our Trust Insights link shortener and bring some of that data into so there is still more we could do to bring in and squeeze out more data.

But definitely, it’s a good approach and a great comment.

I think I know who that is.

But unfortunately it says LinkedIn user by

Katie Robbert 40:00

We can find out afterwards we know all and I think on that note,

John Wall 40:10

the Oracle says it’s over.

Christopher Penn 40:12

Alright, so next week we’re going to tackle the sowhat.

So stay tuned for that.

And then after that, we’ll probably move on to a different topic.

So thanks for popping in today, folks, and we will talk to you next time.

Thanks for watching today.

Be sure to subscribe to our show wherever you’re watching it.

For more resources.

And to learn more, check out the Trust Insights podcast at Trust slash t AI podcast and a weekly email newsletter at Trust slash newsletter.

got questions about what you saw in today’s episode.

Join our free analytics for markers slack group at Trust slash analytics for marketers.

See you next time.

Need help with your marketing data and analytics?

You might also enjoy:

Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday!

Click here to subscribe now »

Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new 10-minute or less episodes every week.

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This