So What? Marketing Analytics and Insights Live
airs every Thursday at 1 pm EST.
You can watch on YouTube Live. Be sure to subscribe and follow so you never miss an episode!
In this episode of So What? The Trust Insights weekly livestream, you’ll learn how to approach a data quality audit in an AI-ready world. Discover why relying on platforms like Google Analytics alone for your marketing data can lead to skewed results and how these seemingly minor inaccuracies can compound into major issues.
Watch the video here:
Can’t see anything? Watch it on YouTube here.
In this episode you’ll learn:
- The three levels of a data quality audit
- The 6C data quality audit framework
- What AI will fail without good quality data
Transcript:
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.
Katie Robbert – 00:35
Happy Thursday. Welcome to “So What? The Marketing Analytics and Insights Live Show.” I am Katie, joined by Chris and John. Howdy, fellas. Well done. This week, we are talking about the AI-Ready Data Quality Audit. We’re going to talk about the three levels of a data quality audit. We’re going to talk about the 6C data quality audit framework, and we’re going to talk about what AI will fail without good quality data, which I believe should be where AI will fail. A human Katie wrote that, so here we are. AI did not write that. Human Katie did write that with all of her poor grammar. We’re talking about data quality, and we’ve been talking about this for a couple of weeks. Let me back up.
Katie Robbert – 01:21
We always talk about data quality, but we’ve really kind of focused on data quality in the past couple of weeks, starting with the LinkedIn algorithm paper that we put out that Chris compiled. Was it just a week ago? Maybe a little bit more than a week. So this was to help understand what was going on with the LinkedIn algorithm. Unsurprisingly, it’s at their whim—a bunch of large language models that they can tune up, tune down. On Monday—not Monday, we recorded on Wednesday—on our podcast, “All My Days are Just Wrong.” This week, Chris and I talked about data quality to help train your social media algorithm. You can get that at Trust Insights AI TI Podcast.
Katie Robbert – 02:16
Because we really want to help ourselves and our audience understand that data quality isn’t just, “Okay, I’ve exported something from Google Analytics. Is it missing anything? Yes. No. Okay, great. That’s my data quality.” Data quality extends far beyond just the output. It’s also the inputs. It’s what you’re putting in, but it’s not just into your web tracking system. It’s not just into your CRM. It’s into your social media. It’s into the content that you’re producing. It’s how you’re disseminating content. It’s how you’re making choices about where things go. So, Chris, where would you like to start this week?
Christopher Penn – 02:58
We probably should start with perhaps even the whole point of a data quality audit. The whole point is to make sure that, like you said, what’s going in makes sense. If you think of a data quality audit as the equivalent of getting a product, taking off your glasses, and looking at the ingredients. Buttermilk, natural butter flavor, Maltodextrin, granulated garlic, granulated onion, sea salt, sugar, dried dill, and citric acid as a sigilant. By looking at the ingredients on a product, I know what’s in it. I know whether it’s safe to use. If I have an allergy to one of these things, I need to know that before I go putting this on my food, because if I have an allergy to a really long word and I consume it, it causes harm.
Christopher Penn – 03:48
You actually, Katie, talked about this recently with a conversation about data hygiene. Hygiene has three levels. There’s remediation, like bad things are happening; you need to get clean. There’s preventative, like brushing your teeth so that bad things don’t happen. And there’s optimization, doing things to advance your health. So that’s like basic hygiene. With data, you have exactly the same thing. You have bad data that’s screwing up your decision-making, so you have to fix that. That’s the baseline level. Then there’s preventative. What can you do to prevent bad data from getting into your systems? Then there’s optimization to say, “What new data or what additional data can we bring in to make our systems better?”
Christopher Penn – 04:39
If we think about that data hygiene conversation in the context of a data quality audit, the data quality audit really is the diagnostic that tells you what level of hygiene you have.
Katie Robbert – 04:51
We talked about this in the newsletter this week: even small things like skipping flossing. If we’re going into the hygiene analogy arena, let’s say I skip flossing two of the seven days a week. “No big deal, right? I’ve flossed the other five days.” That actually over time becomes compounded so that when I go to the dentist, they’re like, “Hey, you actually now have a cavity.” I’m like, “What are you talking about? I’ve been brushing every day. I flossed most days.” But you’re supposed to be flossing every single day. We’re trying to think about it in very simple terms.
Katie Robbert – 05:33
When we think of a data quality audit, we’re making assumptions that we have this big, wonky data set that has all these issues, and it’s not clean. It’s messed up, and it’s got strings where there should be numbers. It’s not even that deep; it could be something very simple. But it’s those small, tiny things that we think, “Oh, that’s not a big deal. I’ll deal with that later.” Or, “It’s just missing one day. That’s fine. It’s not going to matter over time.” Those small issues compound to be very large things, and then you’re making decisions based on what you think is really good data, but has actually become really poor quality data.
Katie Robbert – 06:16
I would say the data quality audit, or the purpose of doing a data quality audit, is to make sure that you’re making the right decisions. You may not see big issues, but you may have some things that, over time, if not addressed, it’s like an injury or anything else. It might be okay today, but three months down the line, six months down the line, some of us are going to PT 30 years down the line. It’s a big deal, and you have a lot of work to do.
Christopher Penn – 06:46
Yep. The three levels of a data quality audit. There’s baseline human, which is, “I’m going to go into Google Analytics and I’m going to look at my data. I’m going to run reports. I’m going to examine the data along the six dimensions of good quality data to figure out if my data is in good shape.” That is something that everybody can and should do. That’s level one. Level two is to use automations to process the data and do things like spot anomalies. This is what I would call classical machine learning, where you’re looking for, “Hey, what happened there?” We see this, for example, in our CRM data a lot.
Christopher Penn – 07:26
There are some forms that just attract garbage submissions, and you can always tell because it’s a string of numbers and then some random domain on the other end. “Okay, clearly that 280 of those in a row is pretty clearly not good data.” Then at the highest level is almost what you would call agentic data quality, where you have AI, classical and generative, operating on your data in real-time to alert you, like, “Hey, you’re getting more bad data than you thought today.” You already have experiences with some of these tools. For example, with Google Analytics, inside Google Analytics is the alerts facility. If you do not have an alert set up inside your Google Analytics that does something as simple as, “Hey, I noticed you stopped sending data.”
Christopher Penn – 08:22
Are you aware that your system is no longer sending data? That’s one of those things that you should know. Those alerts are part and parcel of the system. You can do a lot more than that now with generative AI, but those would be some of the very basic things that an agent would do to increase data quality. If you think about it, that kind of maps to the levels of hygiene where you have, “We just got to get caught up. We’ve got to get that, get the house in order to get caught up now.” “Okay, let’s talk about preventing things, using automations to clean data on the way in perhaps, and then ultimately doing that proactive.” Let’s make everything better and know well in advance that something’s going wrong.
Katie Robbert – 09:10
Before we get too deep into the AI-ification of data quality, John, as the person who I would argue owns our CRM, how much do you think about the data hygiene going in? You do a lot of the input into the system along with the automation that comes in. How much do you think about the data hygiene and the process? You can be honest, this is a safe space. I’m not going to hold it against you. During the course of the live stream, we’ll talk offline. I’m just kidding.
John Wall – 09:46
No, things have changed a lot.
Christopher Penn – 09:48
So much.
John Wall – 09:49
It used to be a critical thing. In other organizations, I’d have to do it monthly because there were all these dependencies that had to be perfect. All the addresses had to be right because when you dump them to mail a catalog or do some kind of direct mail, your data had to be clean. If you’re doing anything else, email campaigns, whatever, anything where you’re pulling the data out, it had to be in perfect alignment all the time. The flossing analogy is perfect: you don’t realize that just one out of seven, if you’re missing two days, that’s a huge percentage of the overall time.
John Wall – 10:29
These things compound. Once they start going off the rails, it gets worse and worse. But there’s so much going on now with AI. Because we’ve talked about, for our source codes, it used to be like you would want to have a pick list so the source codes are straight, but now you can just grab raw text and have AI summarize it for you. That actually does a better job than the human pick lists and can give you relationships that you didn’t even realize existed and a whole bunch of other stuff. So, I’m not as worried as I was about data quality.
John Wall – 11:04
Things are getting better and easier, but I’m completely interested in digging in here and finding out what kind of stuff can do a better job of cleaning it up. Ultimately, setting up at the front door so that you can stop the problems before they even start—the preventative stuff—is the highest value. But, yeah, interested in digging into the tool set.
Christopher Penn – 11:27
Yep. The first place we should start is what data quality is, because you can’t do an audit if you don’t know what you’re auditing. This is the 6C data quality framework, which you can get on the Trust Insights website. Go to TrustInsights.ai and it’s under the “Insight” section. The six aspects of data quality are clean, complete, comprehensive, calculable, chosen, and credible. Each of these is a dimension of data quality. Clean data—you think, that’s the most important thing. Although they’re actually all important. Clean data is data that is prepared well and free of errors. This includes things like formatting errors, incorrect calculations, malformed text in the wrong language, text in the wrong encoding. The data itself is just not clean. That’s number one. Number two is complete. Is your data complete?
Christopher Penn – 12:21
This data has no missing information, no field. If it’s a table, there are no cells within the table that are missing stuff. It shouldn’t look like Swiss cheese. It’s fully available based on the format of complete tables. Data is not missing at random. One thing that’s really tricky about “complete” is that it’s the hardest of the six Cs to deal with because you may not know that something’s missing if it’s not there. For example, this lovely device, the iPhone, is one of the platforms that blocks a lot of trackers, like Google Analytics. iPhone users are underrepresented in your data. When you look at your Google Analytics data, it won’t show, “Oh, iPhone, that was an iPhone.” I have no idea. It’s just not there.
Christopher Penn – 13:12
Because it’s not there and you don’t know it’s not there, you don’t know that your data is incomplete. That’s number two. Number three is comprehensive. Does the data answer the question that’s being asked as it’s scoped correctly? For example, if I had food information of some kind, and my question is, “What is the calorie count on this food?” and there are no calories—just listed ingredients—it doesn’t answer the question. All the other data on here, the ingredients are on here, but there’s no calorie information. Then it is not comprehensive. It doesn’t answer the question being asked. Calculable is the most important for AI. Calculable means formatted in a structure that both machines and humans can use.
Christopher Penn – 13:58
This is especially important for generative AI because generative AI has some languages that read well and some languages it reads really poorly. CSV files, TSV files. Generative AI struggles with markdown or JSON. Generative AI can read that data much more easily. So a big part of “calculable” is the data. If you’re going to be using AI on it, all the data needs to be converted to the appropriate format first before you can use it with AI. The fifth is chosen: no irrelevant or confusing data. Data that’s chosen well is comprehensive enough to answer the question, but then doesn’t have distracting or irrelevant stuff. So if you’re saying, “I want to know the ROI of our LinkedIn paper,” and you’re handing me Twitter data, that might not be relevant.
Christopher Penn – 14:48
That might not be super helpful, or you’re handing me pay-per-click advertising data, and we didn’t run ads. One of the things that all of us, but marketers in particular, have a tendency to do is when they are under stress or strain, they tend to back the truck up and pour data everywhere and say, “Here, I’m giving you everything.” No. The last is credible. It means today it was collected in a valid way, statistically representative samples, data that’s appropriately weighted (if you’re doing weighted averages and things), data that’s not biased. It also means—and this is a really hard one—data that is sourced from credible sources of good quality data.
Christopher Penn – 15:29
For example, say you were working with a data set, and the entity that produces that data set suddenly had somebody come in and say, “I want the data to say this.” That data is no longer credible. My wife used to work at a survey company way back in the 2000s. They’ve been out of business for 20 years now. You’ll find out why. Very shortly, people would call them up and say, “I need a survey that says this so that I can go and run ads saying, ‘Surveys show that nine out of ten people prefer the color blue on their toothbrushes.'” They would go out and conduct a survey to get that answer, which is the opposite of good survey practice. So credible is that. So that’s the 6C framework.
Katie Robbert – 16:15
You’re talking about using AI, but I mean, it’s a good gut check just for humans as well. John, this is what you were talking about in terms of the data going into the precursor to it going into the CRM. If you think about things like “chosen,” it immediately makes me think of all those dashboards that we’ve either created or seen where it’s just “everything,” and you don’t know what to pay attention to. But if I put everything on there, then maybe somewhere along the lines, that’ll tell a story, or, “credible.” I think the survey example is a really good example. We’ve probably all read into that of, “I need to say this about our product.”
Katie Robbert – 17:04
You can’t go in telling me what the answer is going to be. That’s not how surveys work. A lot of this is just a really good, if nothing else, checklist for humans before you even get to the machines.
Christopher Penn – 17:20
Exactly. The machines, the age-old rule that’s been true for what, 60 years now, is still true: garbage in, garbage out. If you give AI bad data, it’s not going to magically make it better. If anything, it’s going to make it worse because that’s what AI does. AI takes the good and makes it better, takes the bad and makes it worse. That’s what’s going to happen. These are the six criteria that any data you want to audit, you have to go through each of these six criteria. So let’s do a super simple example. Let’s go into our Google Analytics and ask ourselves a question. I’ll go to the “Explore” tab. Let’s start an exploration here. Let’s say I want to know about the LinkedIn paper.
Christopher Penn – 18:04
That seems like the kind of thing that I might want to know about. So I’m going to say, “Let’s do the LinkedIn landing page. The landing page contains ‘LinkedIn Algorithm Guide’.” All I really want to know is how is that doing? I get some nice charts here. The first step in the process is, “Is the data clean?” The answer here is yes. If I were to export this in a useful data format like CSV, the data is pretty clean. It has dates and numbers. Katie, you look flummoxed.
Katie Robbert – 18:46
I don’t know what this chart is telling me. To me, it’s not calculable or comprehensive. What is this? You have this column of totals and then you have a set. I don’t know what this is. To me, not good data quality. I don’t know what it is.
Christopher Penn – 19:08
If we look at the export, it involves Google’s very strange labeling: the date and the number of sessions. That’s all the data is. But I think you raise a very good point. For calculable, for humans, the Google interface is not so useful because you were clearly struggling to figure out what this thing even says.
Katie Robbert – 19:27
I would not have guessed—if you go back to the Google Analytics screen—I would not have guessed that was the date. It might be that, from my view, it’s a little bit small, and I can’t see the slashes. I don’t think those are even in there. But my first instinct was not, “Oh, those must be dates.”
Christopher Penn – 19:48
Because it says “totals.” It does. There’s the column header for “date” up here, and then “totals.” It refers to this row because that’s the way Google lays things out.
Katie Robbert – 19:59
Yeah, I would give that a D for calculable.
Christopher Penn – 20:06
Right, which again, is super important. Is the data clean? If we were to look at this data here, is this clean?
Katie Robbert – 20:19
How do you know what it is? What’s how? What are you looking at again? It looks clean in terms of, to the naked eye, to the KDI. It’s rows and columns, and everything seems to be filled out. So now that I know it’s a date, those are dates. Now that I know those are sessions, those are sessions. It looks to be clean. I’m not seeing wingdings. I’m not seeing strings where there should be numbers. So, to the naked eye, it looks clean.
Christopher Penn – 20:51
It is not Google Ads. Five rows of garbage comments up top and then puts a total row in this. If you want to use this and you just loaded this in as is, it wouldn’t work. It would immediately go off the rails. One of the things you have to do is delete the total row, which is annoying that you have to. You have to do that and delete all their commentary to get rid of all that stuff, so that you finally have an actual rectangular table. So, automatically clean? No. Complete? Yeah, the data is there. Is it comprehensive? Does it answer the question being asked? The question we’re asking is, “What is the ROI of the LinkedIn paper?” ROI is “earned minus spent divided by spent.” How much did you earn from something minus what you spent on it?
Christopher Penn – 21:40
Divide by spent. This data tells us none of that. It’s just the number of sessions to that landing page. Google Analytics won’t have that data at all. Our CRM won’t even have that data because we don’t necessarily have a one-to-one of, “Hey, this person visited this page and eventually became a closed-won deal in nine months,” or whatever, because B2B takes a long time. We can’t even answer this question, which should immediately stop the data quality audit. You’re basically saying, “We can’t answer this question.”
Katie Robbert – 22:18
And yet, it doesn’t. What ends up happening is a couple of things. One, if your question is, “What is the ROI of the LinkedIn paper?” and this data doesn’t answer it, people will fudge the numbers. People will go find other data sources, like, “Oh, well, let me go find revenue data and try to mush it together,” and it doesn’t have a correlation. Or they’ll be like, “Well, that wasn’t the question I was trying to answer. I just wanted to know about the awareness.” Does it answer that question? I don’t know. But they’re going to make up the numbers. They’re going to mash numbers together, and they’re going to change their question because they’re going to change their purpose to fit the data versus finding data to answer the question.
Christopher Penn – 23:07
Mm. The reality is this, and this is the ugly reality in marketing: you may not have the data. It just may not be there. We know iPhones block a lot of things from clickstreams. We know certain browsers just flat out block things. We know that about 40% of your data inside Google Analytics is inferred. So it’s literally Google guessing based on usage patterns at the user level. There are all these things that we know that could be wrong with this data, and that goes back to complete. We know the data is incomplete; we don’t know how incomplete it is. The only way to determine that is to look at companion data sets. So your next stop would be, “If I don’t have this information, where could I go get it? How?”
Christopher Penn – 23:58
What would be the way to look at whether that data even exists? The good and bad news is that there are other data sources that are available to you, but you have to go get them. Then you have to figure out, “Are they clean?” The answer to that also may be possibly no. For example, at a very high level, we go to our Google Analytics, and Katie, you say, “Forget all this other crazy stuff. Just tell me how many people visited our website.” We can do that clearly.
Katie Robbert – 24:38
You’re breaking my heart.
Christopher Penn – 24:44
The answer is not really. There’s a reason for it. There’s a good reason for that: we exclude so much of the data because it may not be there. So let’s take a look. Let’s go to our basic reports in GA4. Let’s look at the last 30 days. Make this as easy as possible. Traffic acquisition last 30 days. Let’s just look at the overview. Google says we got 27,000 active users on our website. That’s how many people we got. How does that sound to you?
Katie Robbert – 25:43
That sounds really high.
Christopher Penn – 25:46
What if I told you that in the last 30 days, according to our web hosting company that bills us on visits, we actually had 158,000? Huh.
Katie Robbert – 25:58
That also sounds really high.
Christopher Penn – 26:01
158,000 visits made it to our website. There’s an additional wrinkle if you want to get really crazy about it: what would it look like if you counted everything, including all the bots and scrapers and all that stuff? What would your traffic look like then?
John Wall – 26:24
One million visitors.
Katie Robbert – 26:27
But I don’t even know where this is going because those numbers just don’t feel realistic based on historically. We track the data on a monthly basis and have for years. When I look at traffic to the website getting into the five digits a month, which for some people is no big deal, but for us, that’s a pretty big deal. We have a very small website, and that’s not where people get information about us. They tend to get it from other places. So that, to me, feels off, and I don’t have anything to prove it other than my instinct and historical data.
Christopher Penn – 27:15
If we look at our CDN Cloudflare, the Trust Insights website, in the last 30 days has 147,000 unique visitors, 4.4 million requests to the website, of which it has served up a lot cached. Just on the unique visitors, it’s almost 5x what Google Analytics is seeing, which means one of two things. This is the heart of data quality audits right here. This is why they are so painful. This is why a lot of people choose to have someone do it for them. You have to figure out what the actual answer is. The way you do that is you export all of this data and say which one has the closest outcome to the thing that we actually care about. For B2B, that’s very difficult.
Christopher Penn – 28:10
If you’re B2C and you’re selling packs of chewing gum, this is super easy because you can say, “Just how many packs of gum did I sell on any given day?” Then I can do a straight-up regression analysis and say, “Which of all these numbers correlates to packs of gum?” We can’t do that because we don’t sell a data quality audit a day or ten a day or whatever. That’s just not the way our business works. What we’d want to do is find some slightly lower number somewhere in the funnel, and that could be contacts established in the CRM as an example, on a day-to-day basis, and then say all of that because that number is actually contacts, and CRM is at least something John can work with.
Christopher Penn – 28:55
Based on that, which of these three numbers—Google Analytics, WP Engine, Cloudflare—which one of those correlates the most correctly to that number? That’s how you would get to this very thorny issue, which again is the heart of data quality audits.
Katie Robbert – 29:13
For someone like me, who was trained more classically, you don’t guess at data. You don’t look for the closest directional data set. Data is, or isn’t, accurate, period. This is such a hard thing. Even this many years, over a decade, outside of the academic world, into this world, I still struggle with that. I’m still not totally okay with, “Alright, well, it’s the closest to right. We think.” That destroys me. I know I’m not alone in that, but that’s where we get caught up trying to make the data perfect. It’s just not a reality. The more data sets you have to work with, the more systems you have, the more tech in your stack, the more likely these scenarios are to happen.
Katie Robbert – 30:16
Even if you have a person 100% dedicated to setting these up, maintaining them, data in, data out, John, your sole focus is business development and sales. You can’t stay on top of the data that goes into the CRM 24/7 because a lot of it’s out of your control, despite how we set it up. That’s just the reality. I think that’s where people are struggling. “What do you mean I can’t do individual-level attribution analysis? What do you mean I can’t do attribution analysis at all? What do you mean a company as small as Trust Insights can’t tell who downloaded the LinkedIn paper and then converted into a customer?” It’s hard to wrap our brains around that. You just can’t do it.
Katie Robbert – 31:06
You can’t do that because the data just either doesn’t exist or it’s just poor data quality. It’s the second week in a row that I hate this episode.
Christopher Penn – 31:23
So how would we go about doing that? That’s probably the most important next question: how do we go about figuring out what corresponds to reality here? Is what’s in Google Analytics even real? How would you go about doing that?
Katie Robbert – 31:44
One of the steps that we skipped, quite honestly, is—let me find it—where’s the 5P framework?
Christopher Penn – 31:58
Mm.
Katie Robbert – 31:59
We didn’t purposely skip this step, but that’s where I would start. “Okay, let me just slow down for a second and say, ‘What is my purpose? What is the question I’m actually trying to answer?'” Is it, “I want to know the ROI of the LinkedIn paper?” Is it, “I want to know how much traffic was driven to our website by the LinkedIn paper?” Not just those—those are good questions—but what’s my why? Why do I care about those things? Because that’s going to help me with the rest of the P’s of, “Okay, who has that data?” or “Where in this channel?” We’re talking about people. Where are those people coming from? How did they get the paper in the first place?
Katie Robbert – 32:43
Who’s actually going to help us get the data out and do something? What is the process? How are we extracting the data? How are we thinking about the data? How are we answering those questions? What are the platforms? This, I think, is getting to the heart of what you’re asking, Chris, which is, “Okay, so we’re saying Google Analytics is going to tell us the answer, but that may not be true. We may be looking at other platforms in order to understand the question being asked. So Google Analytics might be the wrong platform.” But first, we have to be clear on the purpose, and then our performance is, “Did we answer the question in a satisfactory way?”
Katie Robbert – 33:21
Did I understand if the LinkedIn paper drove traffic to our website so that we can do more of it, do less of it, do more thought leadership, things like that? We have to have a reason for doing it. So I guess that’s a long-winded way of answering your question. That’s how I would start to think about, “If this isn’t telling me what I need to know, let me back up and check to make sure I’m even thinking about it the right way.”
Christopher Penn – 33:49
I think that’s the most sensible approach: to question the purpose itself. Now, in the interest of helping folks, how would they actually do even something as simple as that question? Let’s go through an example using Google’s Colab. If you’re unfamiliar, Colab is a data science tool. It is powered by Google’s Gemini, which is, of course, Google’s large language model. Unlike Colab, which started out as a pure coding environment and was intended for nerds to be able to run Python and other code remotely, nowadays it’s sort of seen this revival as a generative AI coding environment. That’s super useful because we want to be able to use it and understand what’s going on, but we may not necessarily know the coding language. The good news is, now we don’t have to. So let’s do this.
Christopher Penn – 34:50
What I’ve done is taken two files from our website. One is from the form fills on the actual website. I’ve exported the data out of the exact form fills, and the second is from Google Analytics: the number of thank you page visits from the website. Those two numbers should be identical. The number of people who filled out the form should be identical to the number of people who got to the thank you page. The question we want to ask Google Colab is a very straightforward one: we want to do a correlation analysis. We want to figure out if these two data sets are well correlated. I’ve provided two data sets here.
Christopher Penn – 35:36
One is Google Analytics form completion data of the number of people per day that have filled out and completed a form on our website. I’ve also provided a date-time stamped log of every individual entry in the downloads CSV file of those same form fills. You’re going to need to aggregate the downloads CSV file to day-level counts first and then perform a Spearman correlation to identify how closely correlated these two data sets are, and then provide a weak-level summary of how much drift or variance there is from one data set to the other. So we’ve given it this nice, very long prompt. We’re going to hit “go” on it, and it’s going to come up with a work plan. Then from that work plan, it’s going to start writing the actual Python code.
Christopher Penn – 36:34
The nice thing is, for you and me, and for anyone who is not a particularly good Python coder, it’s going to write the code. It’s going to test the code. It’s going to fix its own errors in the code and then ultimately come up with an answer to let us know. Of these two data sets, which one—how reliable is Google Analytics? That’s the question I really want to know. We have WP Engine. We have Cloudflare. We have GA4, and we’re like, “None of these numbers make sense.” We have a known good number because you got to the form and you filled it out, and I have the Form 8 data. So now what’s going to happen?
Katie Robbert – 37:16
Right, John, I don’t know about you, but I would say 99% of the time I forget that Chris can do speech-to-text in these prompts. So he starts talking like, “Wait, is he talking to us, or is he talking to the machine?”
Christopher Penn – 37:33
Right.
John Wall – 37:33
Is it a driving kit to get him to get things done? As we’re riding here, it’s only about.
Katie Robbert – 37:39
Halfway through his statement, I’m like, “Oh, no, wait, he’s prompted. Okay, don’t interrupt.”
Christopher Penn – 37:45
Whoa, look at that.
Katie Robbert – 37:48
What?
Christopher Penn – 37:48
A 0.65 Spearman correlation. That is awful. Spearman correlations go from 0 to 1. 1 is perfect correlation. 0 is no correlation at all. That should be a 1, not a 0.65.
Katie Robbert – 38:02
Where? Because we are seeing smaller screens. Where are you seeing that?
Christopher Penn – 38:08
Right here.
Katie Robbert – 38:09
Got it.
Christopher Penn – 38:11
That should not be 0.65. For folks who, like me, didn’t do well in stats back in university, there are three different forms of correlation. There is Pearson, which is for parametric normal distributions. There’s Spearman for non-parametric distributions, and there’s Kendall Tau. I can never remember what you’re even supposed to use Kendall Tau for.
Katie Robbert – 38:33
And causation. And causation.
Christopher Penn – 38:38
Spearman is generally the best for marketing data because marketing data, more often than not, is not a binomial distribution. It’s not a bell curve. Marketing data, more often than not, is a power law curve. 80/20 will be Pareto curves. So Spearman can work with that; it typically works better than Pearson. So here we have sessions downloads the drift. It looks like Google Analytics is really messed up in terms of our data. It’s a moderately strong positive correlation but should be perfect, and it should be a perfect correlation. Instead, there’s a negative drift. Google Analytics just isn’t getting the data. Now we know that Google Analytics is blocked by a lot of trackers, by iOS and stuff. I didn’t know it was this bad.
Christopher Penn – 39:38
This is bad enough to make me go, “I don’t know that we should even trust Google Analytics,” because again, it should be a perfect, one-to-one. You download, and Google says, “Hey, you downloaded.” Believe it or not, this is after installing server-side tracking. We installed Google Tag Manager’s server-side tracking two months ago. Now I think it should have fixed this, and it clearly did not.
Katie Robbert – 40:09
This is where it starts to get tricky because someone will see this and then lose their mind trying to fix the system when the system is unfixable. I think that’s just a disclaimer of, “You really need to know the system enough to know what is in and out of your control.” Chris, you said we installed server-side tracking. It should have fixed it so we can at least go back and see if we set it up incorrectly—is it this, is it that, whatever. But at some point, the system is out of your control.
Christopher Penn – 40:47
The system is very clearly out of control in general. Here’s the thing to keep in mind for at least for us in this example: the closer you get to the bottom of the funnel, the more control over the data you have. You can’t accidentally have data form-fill data go missing if someone’s filled out the form. As long as your systems are working correctly and you haven’t blown up your website, that data is there. Someone who is in your CRM that you’re emailing back and forth, unless you’re on drugs, you’re not imagining those people, right? They are actually in your CRM, and you’re actually having conversations with them.
Christopher Penn – 41:29
The closer you are to the bottom of the funnel, the more solid it is. The higher up you go into things like website traffic or heaven forbid, social media traffic, the more you may as well just go by vibes at that point because the data is so incomplete. It is so not comprehensive. It is so not well chosen. It’s not credible that there’s almost no point in trying to audit that level of data quality. So that goes back to the 5P framework, which is “performance.” Is the data quality audit even worth doing if you know the higher up you go, the worse the audit is going to be because you just don’t have control of those systems, and the systems are broken?
Katie Robbert – 42:12
I would argue that it is worth doing, to at least see, so that you could have that data point to say, “This is why we should not be looking at this data.” One of the things that we are offering is the AI-ready data quality audit. If you want to learn more about it, you can go to TrustInsights.ai, “AI-Ready Data Quality.” What we do in there is we actually take a combination of the 5P framework. We try to outline first, “Why are you looking at this data? Who is it for? What?” and so on, so forth.
Katie Robbert – 42:46
We go through the 5P framework, and then we go through the six Cs to say, “Alright, here are all the questions we have about this data based on the 5P framework. Let’s just go ahead and check the quality of the data in general. Is it usable? Should we even be looking at it?” You’ll get a score out of 60. In the example that I often talk through in engagements, an SEO data set from something like Ahrefs. But I don’t state that, so it doesn’t come up as credible. I don’t give the KPIs that we’d be looking for. Those aren’t in there. So it comes back with a score of 15 out of 60, which is a really low score.
Katie Robbert – 43:31
What you get back is, Chris, if you have Colab—if you can pull that up real quick—it’s not the exact data quality audit that Trust Insights would do for you, but it’s a good example. If you scroll to the bottom, you can see you have the data analysis, key findings, but also your insights and next steps. What you would get is basically your roadmap for, “Okay, if you really do need to use Google Analytics, here are some of the things you need to do to fix this data set in order to make it usable,” which really stems into the system as a whole. But at least it gives you some direction of where to start.
Katie Robbert – 44:12
Because each of the six Cs is scored on a scale of 1 to 10, you can see which ones are the most broken versus which ones are the lower priority. So if you’re interested in learning more about auditing your data quality, go to Trust Insights AI, AI-Ready Data Quality. You’ll get a hold of John, our chief statistician, and he’ll walk you through it. Basically, before you make big decisions with your data, before you put your data into something like a large language model, you probably want to do a gut check to say, “Is this even good data?” We went into this episode thinking our Google Analytics is in pretty good shape. We know how to set it up.
Katie Robbert – 44:55
It’s actually one of the foundations of how we started Trust Insights, which was helping people set up their Google Analytics. I would say, Chris, you are head and shoulders above a lot of people when it comes to understanding how Google Analytics, Google Tag Manager, the Google Ecosystem works. Yet we’re still seeing now that the data quality coming out of Google Analytics is pretty poor. That says there’s a lot in the system that we don’t have control over. But unless we had done that audit, we wouldn’t have known. We would have carried on saying, “Oh yeah, this is telling us. We feel confident in how it’s set up.” It’s not a knock against us, the humans; it’s stuff that’s out of our control.
Christopher Penn – 45:39
Yeah, it’s systems we don’t have control over. I would also suggest that for any given measure, that something you get a bonus for, you probably should have some ideally alternate way of measuring it. So perhaps you have Google Analytics installed alongside Matomo, which is server log analysis or anything. If you are paid on leads generated, obviously your CRM is going to have some of that data. It should have that data. But is there an additional system that runs in parallel that you can create, cross-check, and validate?
Christopher Penn – 46:16
The reality is that as these devices get smarter, and as AI gets more and more in the way of us being able to function as marketers or as business folks, we have to work on building trust in our audience so that they give us the data voluntarily and then rely on the things that we know we are given by things like form fills, people joining Slack communities or newsletter lists and things like that. Relying on data that is provided by a third-party system that we don’t own and control, you’re at the whim of that system. As many people have found out from the LinkedIn paper, they can change the engine at any time under the hood. You’re like, “Why does the car feel different now? Why am I not going the same directions?” Because it isn’t your car.
Katie Robbert – 47:12
So, John, you mentioned you were interested to see the tools and techniques. Anything, any revelations, anything you’re going to change, or just more questions?
John Wall – 47:22
Yeah, a whole lot more questions. It’s much more horrible than I had initially hoped. One question I did want to ask, though, is one thing that has been constant. When I’ve seen how marketing people run reports and answer questions, it’s like there’s always a black box in the system somewhere. The fallback is always relative reporting. Even for Google Analytics, even if you’re off by this huge margin, you could still look at the past three months and be like, “Well, we were four times what we were five weeks ago.” So that at least tells you something. It’s kind of the classic, “one-eyed king in the land of the blind.” Is that still relevant now, though?
John Wall – 48:03
Or are we coming to a point where the data is just so munged that even measurements like that don’t help?
Christopher Penn – 48:09
Yeah, that is what our friend Tom Webster calls “predictably wrong versus unpredictably wrong.” Systems that are predictably wrong—like a car that always steers to the right—you always have to tug a little more. Because it’s predictably wrong, you can compensate. You can adjust for it. When it’s unpredictably wrong, and you say, “Hey, the car is veering to the left, and the car just doesn’t operate the way you think it should,” period. It’s a lot more dangerous. The same thing is true of analytics now with Google Analytics. It depends on what’s going on under the hood. It has gotten progressively worse over time as it uses more of its own AI to try and make up for missing data, and you don’t have control over how it does that imputation with something like LinkedIn.
Christopher Penn – 48:57
When LinkedIn changes the second-pass ranker, pulls the whole thing out, puts 360 Bruin in its place, all the data prior to that point is worthless because you’ve got a whole new engine that is completely different. So, to Katie’s point, you have to know what the system is doing to even know whether it’s predictably wrong or unpredictably wrong. The reality is, if it’s a system that’s not under your control, it’s probably unpredictably wrong.
Katie Robbert – 49:26
I think the other side of that, John, that I would add is that it’s all about expectation setting. If your expectation is that I’m going to get a crystal-clear, completely accurate number, then that’s what you have to chase. But if you’re like, “I’m going to use this data directionally” to your question of, “Are we at least moving in the right direction?” So we looked at our Google Analytics website visits, and it was 27,000 sessions. That’s way higher than it normally is for a variety of reasons. But directionally, it tells me it’s not zero. People are coming to the website. What for? We didn’t even get into that. But the website is doing something, so I can at least directionally know things are working. Or we’re getting spammed by bots every other day.
Katie Robbert – 50:22
It’s one of the two.
Christopher Penn – 50:26
Exactly. This concludes another week of disappointing. Katie, thanks for tuning in, and we will see you all on the next one. Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources and to learn more, check out the Trust Insights podcast at Trust Insights AI TI Podcast and a weekly email newsletter at Trust Insights AI Newsletter. Got questions about what you saw in today’s episode? Join our free Analytics for Marketers Slack Group at Trust Insights AI Analytics for Marketers. See you next time.
|
Need help with your marketing AI and analytics? |
You might also enjoy: |
|
Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday! |
Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new episodes every Wednesday. |
Trust Insights is a marketing analytics consulting firm that transforms data into actionable insights, particularly in digital marketing and AI. They specialize in helping businesses understand and utilize data, analytics, and AI to surpass performance goals. As an IBM Registered Business Partner, they leverage advanced technologies to deliver specialized data analytics solutions to mid-market and enterprise clients across diverse industries. Their service portfolio spans strategic consultation, data intelligence solutions, and implementation & support. Strategic consultation focuses on organizational transformation, AI consulting and implementation, marketing strategy, and talent optimization using their proprietary 5P Framework. Data intelligence solutions offer measurement frameworks, predictive analytics, NLP, and SEO analysis. Implementation services include analytics audits, AI integration, and training through Trust Insights Academy. Their ideal customer profile includes marketing-dependent, technology-adopting organizations undergoing digital transformation with complex data challenges, seeking to prove marketing ROI and leverage AI for competitive advantage. Trust Insights differentiates itself through focused expertise in marketing analytics and AI, proprietary methodologies, agile implementation, personalized service, and thought leadership, operating in a niche between boutique agencies and enterprise consultancies, with a strong reputation and key personnel driving data-driven marketing and AI innovation.