In-Ear Insights: Risks of Replicating Data Without Sources

In this week’s In-Ear Insights, Katie and Chris discuss the challenges of replicating data and what to do when asked to use data to make conclusions. They explain that the lack of information can be a hindrance to making accurate comparisons, but having domain expertise and knowledge of available data sources can help in creating proxies to make valid comparisons. They also emphasize the importance of documenting assumptions and limitations in the absence of complete data.

[podcastsponsor]

Watch the video here:

In-Ear Insights: Risks of Replicating Data Without Sources

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Christopher Penn 0:00

In this week’s in your insights, let’s talk about missing data, and what to do when you’re asked to use data to help build conclusions, find insights, etc.

But you don’t have the data either.

It’s not available to you, because maybe it’s a competitor that you’re working against, or it’s a potential merger and acquisition, you’re being asked to guess at something we had that happen relatively recently.

And in the hospitality field, we had one client saying, Hey, can you help us try to estimate whether this is a good purchase or not? Or you’re dealing with a situation where you might be adversarial.

So for example, in, in the whole Warrior Nun thing, people say, Well, how do we substitute the data that Netflix has, because Netflix clearly isn’t just going to hand that over to anybody.

So Katie, no matter what the situation is, maybe it’s a dashboard that you’ve inherited, that you don’t know the underlying data versus what do you think about when you’re being asked, Hey, we need to use data to make decisions.

But we don’t have any data for you?

Katie Robbert 1:03

Well, you know, it’s, I think about a lot of things when that comes up.

But it’s really, you know, it’s interesting, I think that there’s this assumption, Chris, that you can magically reproduce any kind of data, because you’ve spent a lot of time researching and figuring out where to supplement data and what data sources are close enough, but also how to use machine learning to infer data.

Not every marketer has that skill set, or that understanding of what that looks like.

And so, you know, when we are handed a report with no methodology statement, not even timeframes, things that just say, last month, but we don’t know what month that was, you know, my first thought is, there is no possible way you can expect us to rep to replicate this exactly as it is using a different data source.

There’s way too many variables.

And I think that’s the first thing I think about when I hear about these situations, or when they’re presented to us is, you don’t know how data works, do you.

And, you know, and I don’t mean that in a, you know, offensive sort of, like put down but like, you have to understand that there’s a lot of variables that go into data.

In order to make it a one to one or close to a one.

One to one comparison.

It’s one of the reasons, Chris, why when we’re doing reporting, I get so finicky about the date ranges.

And you know, if we have to, you know, do it at like 12 o’clock in the afternoon versus, you know, 1pm, the day before, my internal like data integrity alarms start going off, it’s like that is not a one to one, because you have an additional 12 hours, that is not, you know, being accounted for.

And this goes back to my clinical trial, academic roots of it has to be an exact, you know, the 12am cut off the 12am cut off here.

Otherwise, you can’t call it a one to one comparison.

And so that’s sort of the first piece that I think about is what are all the variables that go into the data analysis that not even the analysis piece, the methodologies, the techniques, but even just the data to say, this is what we use, this is what this data represents.

It’s this date range, these cut offs, these metrics, you know, these, whatever they are like, those are the pieces that I 100% need to understand first and 10 times out of 10.

That information is not noted anywhere.

Christopher Penn 3:47

Exactly.

I tend to think of this, like you’re handed a dish and you taste it.

And then someone says, Hey, can you recreate this? And like, well, there’s no recipe I mean, you have you know what it looks like because it’s in front of you, but there’s no recipe, there’s no instructions, you’re not sure what the ingredients are.

And I think that’s where a lot of experience comes into play.

You know, if you’re an amateur Cook, who’s maybe only been cooking for a year or two, you may not have had done you know, Herb tasting experiments to see like what different herbs tastes like in different settings.

You may not have tried, you know, bulgur wheat versus regular wheat versus you know, sprouted wheat.

So you won’t know those differences.

But the more experienced you are, the more experiences you have you like, oh, that’s white truffle oil.

I know what that flavor is, oh, that’s that’s combination of celery powder and onion to get that kind of ranch dressing like flavor.

And so the more experience you have, the more domain expertise you have, the easier it is for you to reverse engineer a dish that you’re tasting at a restaurant.

Perhaps you’re like you’re like, Okay, this is really expensive.

So I kind of want to know how it works.

And I think the same thing is true with data when you’re handed a dashboard, for example, just by looking at it and looking at the names, the metrics or maybe even the thought But it’s in you like, okay, that’s, that’s a Data Studio dashboard from Google.

And that’s the naming of that.

That’s almost certainly Universal Analytics vs GA for because they reported they have quite differently.

And the more experience you have, you look at something like, you know, okay, that’s a Google Business Listing, or that’s Search Console that’s not Google Analytics data, the easier it is for you to reverse engineer it, but it requires that level of domain expertise to go.

Okay, I’m pretty sure I know what this is.

And even if I don’t know exactly what it is, you know, it’s like that chef, even though the chef doesn’t know exactly that if it’s white truffle oil, or black truffle oil, they’re like, you know, I can make some that’s really close.

Katie Robbert 5:40

Well, and that, you know, it’s interesting, because we went through an exercise like that recently.

And I’m certainly not the more experienced analyst on the team.

But I was able to go through some materials that we were given a do just that there were fields such as branded search and unbranded search, and I had enough expertise to know okay, that’s likely Search Console data, where we didn’t know enough information was the the characteristics, I can’t think of that I can’t think of the word, the requirements, the qualifiers, basically the way in which the data was being filtered.

To say it, I want it just for this kind of product, or I want it just for this kind of timeframe.

Or this is what I consider branded and unbranded.

And I feel like those markers without having that that’s where we just start to guess at things we make assumptions based on our expertise and our experience.

But unless someone hands us a set of requirements, we don’t actually know.

Christopher Penn 7:00

Exactly, and that’s exactly like, here’s, here’s the outcome, but we’re not telling you what has to be in the recipe like, Okay, I’m going to do my best guess.

Yeah, the challenge you went into, obviously, is if somebody that says, Oh, we’ve got to tell you has to be gluten free, like well, told me that before, I used this to two pounds a week to make this thing.

And the same thing is true with that, you know, in that particular instance, we ended up reengineering the Search Console data in, in our in, in, you know, taking programming languages and recoding it, in order to make it in order to make it look like and work like the original.

I doubt that’s how the original was made.

But that was the most efficient and effective way that we could do it.

And so that then comes back to well, you know, how do you how do somebody who is not an expert at digging data out of an API, and then put it into a SQL database, and then doing the engineering on it out? How does a marketer like, who’s not me do that?

Katie Robbert 8:04

Well, they hire you.

They pick up the bat phone and say, Chris, that is I was a total tangent.

If you’re watching this podcast, I was watching a video yesterday.

And someone’s asking their kid, like, how do you answer the phone? And those of us who were older, answer it like this? Those of us who are younger, answer it like this, like they’re holding a cell phone.

So anyway, I was mimicking picking up the bat phone calling Chris.

And so if you are someone like me, who doesn’t have the skill set of a Chris Penn, then you really have to start to pick apart and just start asking a lot of questions.

Now, the source that we got this from didn’t, we knew didn’t know the answers to the questions we would have in terms of how this was put together.

And so without having a software development skill set without having the ability to code something, then we’re going to have someone like me would just have to go through and note all of the constraints, all of the different places where I could not replicate something, because I did not have enough information.

And you know, that’s typical in terms of, you know, standard data analysis of just noting, I didn’t have enough information to draw a conclusion.

That’s a perfectly acceptable response, as long as you explain why that is.

So what you would need to do so let’s say, Chris, you know, you’re handed a report, and you don’t have access to Search Console data, but they want you to list branded and unbranded search, that’s going to be a nearly impossible feat, if all you have is just straight Google Analytics, like there’s probably ways that you could, you know, use the query data, those kinds of things, but it’s not going to be an exact one to one and so what you would need to do in that situation is start to document here is What I assume this data is I assume that it is Search Console data, I assume that there are filters to include the name brand, and to not include the name brand.

And that’s going to be the branded and unbranded search, I don’t have access to that data.

So here’s what I’ve put to stand in instead.

And so what’s often missing is just that simple documentation, so that people explain, here’s what I understand, here’s what I don’t have.

And here’s what I’m able to do.

Because we tend to get we it’s either insecurity or an overconfidence or both, that we can just replicate it and put it out there.

And hopefully, nobody notices.

Christopher Penn 10:42

And you hit on a really important point there, that there’s the data.

So the technical capabilities, right, can you can you replicate this report, but the other challenge you run into is, if you don’t have the data itself, I mean, that that, in itself presents a really interesting challenge.

And again, requires that level of subject matter expertise within within your industry, but also within the data ecosystem that exists online to say, Okay, well, we don’t have that information, what can we get instead? So to your point about branded unbranded search, if you don’t have that data from someone Search Console, where else could you use that information, you can get it from a couple of places you can get it, for example, from Google Trends, you can get a relativistic measure of one brand search versus another, you can get from SEO tools like Ahrefs, and SEMrush, and SpyFu, and stuff.

And then you have to engineer it together to say, Okay, here’s the reverse engineered version of this data set.

Based on these two sources, you can calibrate it.

Another example.

That is, is one that we can share.

We were asked, I was asked as part of that the whole Save Warrior Nun campaign, hey, how do we show interest or relevance? Knowing that we can’t get Netflix to give us the watch data? Right.

So where else could we go? And this is one of those things, I think is is incumbent upon everyone, regardless of what you’re working on, to know what data sources are available in your industry that you could use as proxies.

So one of the ones that we use as a proxy is IMDb ratings, the Internet Movie Database users of that system of that of that software can go in and leave ratings and reviews.

And the number of votes that a show gets, tends to be proportional to the viewership, right? For example, in a recent look at the data, of Stranger Things got like 10 times the number of votes of any other show.

makes logical sense.

And, you know, Netflix pays a gazillion dollars to promote that show.

Of course, it should, it should have higher numbers of votes.

So in those cases, we can say, Okay, well, we’re gonna use this as a proxy for viewership because we don’t have the viewership data.

But this is a this is it passes the sniff test in terms of investment in terms of what we see online with social media data with search data.

On the IMDB data train, Gale is pretty well on so we can say, Okay, we’re going to use this as a proxy measure to then be able to compare one show versus another.

And because it’s industry wide, you can now go outside of that one ecosystem and say, Okay, how does Stranger Things or Warrior Nun compared to the Mandalorian, which is on Disney plus, totally different network? And so by knowing the data sources within an industries, and what’s available online, you can start to build directionally correct models? Is it going to be exactly those numbers? Now, let’s say, you know, it won’t be exactly the proprietary numbers.

But is it good enough to make valid comparisons? I would argue, yes, it is.

It’s, you can look at the data go, okay, Stranger Things has TEDx the traffic of any other Netflix show.

Need help with your marketing AI and analytics?

Machine-Generated Transcript

Christopher Penn 0:00

Katie Robbert 1:03

Christopher Penn 3:47

Katie Robbert 5:40

Christopher Penn 7:00

Katie Robbert 8:04

Christopher Penn 10:42

2 thoughts on “In-Ear Insights: Risks of Replicating Data Without Sources”

Leave a Reply Cancel reply

Subscribe to our Weekly Newsletter

Pin It on Pinterest