In-Ear Insights: How Data Quality Impacts Social Media Algorithms

In this episode of In-Ear Insights, the Trust Insights podcast, Katie and Chris discuss how user data quality fundamentally controls social media algorithms, including the new LinkedIn LLM system.

You will learn how modern social algorithms use language models to predict your engagement, shifting the control back to you. You will discover why your likes, comments, and shares act as critical input data that constantly trains the underlying system. You will understand how to implement a personal data stewardship strategy to ensure you see the content you value most. You will stop feeding the negative engagement cycle by recognizing how algorithms treat all interactions equally. Watch the full episode now to take control of your social media feed!

Watch the video here:

In-Ear Insights: How Data Quality Impacts Social Media Algorithms

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

[podcastsponsor]

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Christopher S. Penn – 00:00

In this week’s In Ear Insights, let’s talk about data quality. It is spooky season, which means that the ghosts, goblins, and trolls lurking in your data are more likely to trick you than treat you. Well, that was probably too much Halloween for the end of September. By the time this podcast airs, it will be October.

Data quality has been an issue for as long as there’s been data, which is millennia. As we use generative AI in particular, more and more people forget that’s built on a foundation of good data. If you give an AI model bad data, of course it’s going to come up with bad results. You say, hey, it’s perfectly okay to put glue on your pizza as bad input data. AI will parrot back that bad information. And we see this even in places like LinkedIn.

Christopher S. Penn – 00:57

LinkedIn just changed its algorithm dramatically. We have a whole new paper over on the Trust Insights website, the Unofficial LinkedIn Algorithm Guide, which you can go over to TrustInsights.ai and just see it in the Insights section there about what’s gone on under the hood. It’s a big change. But what powers that big change is the data you put into LinkedIn. So if you are putting in bad data, guess what you’re going to get out of the LinkedIn algorithm?

Katie Robbert – 01:25

Well, and so let’s talk about bad data into LinkedIn, because I can see where that could be confusing for people. When we talk about data quality, we often think about our tracking system—our Google Analytics or our other website tracking systems, our CRMs, marketing automation, etc. We think about garbage in, garbage out, or can I analyze the attribution or do I know how many people took an action based on this thing that’s tracked in those systems.

But when we think about something like LinkedIn, I don’t know that we’re thinking about good data versus bad data. There’s all kinds of old wives’ tales and myths and conspiracies about how to get seen and get traction on a social platform like LinkedIn. And I think what we’ve seen time and time again is it really comes down to what makes sense for you, the individual, because you have no control over the algorithm.

As we have seen and as we know, and actually as of this morning, my algorithm is different from what you published in the paper last week. The catalyst, just to back up a little bit for the paper, was that a lot of people were complaining that one, their stuff was not getting shown or not getting any engagement. And two, they weren’t seeing the stuff that they wanted to see anymore. They were just seeing a lot of AI generated content, a lot of ads.

Katie Robbert – 03:07

Nothing really of value.

Katie Robbert – 03:10

And so Chris, you took that information and distilled it down into the unofficial LinkedIn paper to say, here’s what’s happening. But for me, an individual, as of this morning, that’s all out the window. Something has changed for my algorithm. I have not changed anything.

Katie Robbert – 03:29

I didn’t change any settings.

Katie Robbert – 03:31

But all of a sudden, I’m seeing all the stuff that I missed for the past month only from the people that I want to see it from. So when we talk about good data, bad data into LinkedIn, what are we talking about for an individual? What is a way to put bad data into LinkedIn?

Christopher S. Penn – 03:48

So let me do. And for the folks who are just listening to this, I’m going to share my screen. But you can see this on page 31 of the unofficial LinkedIn algorithm guide. At the heart of the system, once you get past all the, is this obviously garbage rankers, is this new system that LinkedIn came up with called 360 Brew, which is a large language model. It is a custom tuned large language model based on a model from—believe it or not—Mistral, their Mixtral 8×22 model. And it takes in LinkedIn data that you provide as the member and puts it into a prompt, and the model then tries to do a prediction. So this is example one: prompt to predict a like on a marketing strategy post, “you are provided a member’s profile instead of post their content, blah blah.”

Christopher S. Penn – 04:37

“Focus on topics, industry and the author seniority more than other criteria in your calculation and a 30 weight to the relevance and 70 weight to the member’s historical activity. Here’s the member profile past post interaction data. Member has commented in the following posts and there’s some examples there. Member has liked the following posts. Member has dismissed the following posts. Question: Will the member like, comment, or share? Dismiss the following post,” and then it has an example post there and says, “answer the member will like.” What’s going on here?

Christopher S. Penn – 05:06

LinkedIn, once it comes down to a candidate set of maybe a couple of hundred possible posts to show you, puts it through a language model with your most recent interaction data. It says, “Try to predict whether or not the member is going to like this post.” From that, it decides, “Okay, Katie’s going to like this post from Sonny Hunt or Brooke Sellis, and Katie’s not going to like this post from Christopher Patton. Yeah, that guy’s a jerk. So let’s not show his posts to Katie.” Every time this prompt runs behind the scenes, it is coming up with these candidate predictions to say this is likely what Katie is probably going to engage with, because our goal is to get Katie to like a post or comment on a post of things.

Christopher S. Penn – 05:52

The bad data that we’re talking about is what you’ve just liked, what you’ve just commented on, what you’ve just engaged with, what you’ve just dismissed, what you’ve just reported. Because we know the underlying model, we know it has a relatively small context window, relatively small short term memory. So stuff that’s more recent is going to have a lot more influence. Now, what we don’t know is that LinkedIn has obviously built and fine tuned this off-the-shelf model substantially. You can’t just download the stock version and say, “This is just what LinkedIn uses.” No, it’s not. And we don’t know the exact actual prompt. We can infer it based on the other prompts that LinkedIn has shared for this model, but we can’t do the newsfeed ones because they didn’t share those.

Christopher S. Penn – 06:37

But from this architecture, we can guess pretty well what is happening and why. Your feed is now going to change dramatically from day to day, perhaps even hour to hour. You noticed in that example prompt that it has some built-in weights—30% this, 70% this. It’s trivial for an engineer to go and say, “Okay, well, now make it 25/75, now make it 50/50,” and they can turn that dial very quickly without having to re-architect their whole system. That’s what’s so different about this, because they’ve gone to using a language model. The older system of much more complex regression-based analysis is now just a prompt style that someone can tune immediately, and the whole system changes.

Katie Robbert – 07:24

So if I’m the user and today I’m seeing everything I want to see in my feed, I shouldn’t get too excited because that could change by this afternoon, depending on the whim of an engineer. I won’t say whim—let’s give them a little bit of credit and say maybe they’re doing some testing to see what works the best for the most people versus for an individual user. As of yesterday, as an end user, I wasn’t seeing any of this content. As of this morning, every time I refresh LinkedIn, I get a whole new batch of stuff that I missed over the past three weeks, which is great. But also now all the stuff that I missed—those poor people are getting stalker notifications from me, like, “Katie just liked 36 of your posts.”

Christopher S. Penn – 08:14

But is it doing the job that it’s intended to do? Are you engaging with that content? Are you updating that prompt with all these new engagements?

Katie Robbert – 08:25

And I—I’m assuming you’re asking because the answer is yes.

Christopher S. Penn – 08:29

The answer is yes. If you are engaging with this stuff, which then goes back into the historical record and the first pass rankers that say, “Hey, this is the stuff that Katie likes. Let’s continue to test and see what else it is that she likes until we’re dialed into what will get Katie to stick around the longest and see the most possible ads?”

Katie Robbert – 08:56

Gotcha.

Katie Robbert – 08:57

What’s fascinating to me is we often talk about one of the use cases that people are struggling with is personalization at scale. This—correct me if I’m wrong, Chris, because I know you will. Sorry. Well, actually, this is personalization at scale.

So while it’s a large language model and a prompt for the user base of LinkedIn as a whole, my individual experience on LinkedIn is now different from your individual experience on LinkedIn, but it’s powered by the actions that I, the user, am taking. I think this sort of goes into a larger conversation of personalization at scale, which we can certainly do on a future episode. But I think it’s worth noting that this is where marketers are getting that content personalization at scale wrong, because they’re not thinking about the give and take.

Katie Robbert – 10:08

They’re just, “How do I create a bunch of personalized content for my end users without considering what do my end users need to do in order to make that happen?”

Christopher S. Penn – 10:22

Exactly. And the thing about using an LLM at the heart of the system is language models are inherently probabilistic, which means that if you give any language model the same prompt twice in a row, starting a new chat each time, you will get different answers. They will be thematically similar, they will ideally be factually similar, but they will never be the same. Which in turn means that the probabilistic nature of the system means that you can’t predict how this model is going to give a percentage score recommendation as to what you’re going to like and not like. And things that you engage with the most recently are going to have a big outsized influence, because we know how prompts work in these systems, we know how they work in the context window.

Christopher S. Penn – 11:07

And so all this boils down to a data quality issue for us, the users. When we think about who do we engage with, what do we comment on, what kinds of engagement do we hand the system. A lot of people are talking rightfully about the fact that social media tends to polarize, tends to drag people to extreme points of view. It’s because that’s what gives engagement.

And these systems don’t have morals. They don’t have any kind of moral or ethical judgment. They can’t, because to your point, they’re universal. And so we have an obligation ourselves to curate our own data and to ensure our own data quality by doing things. Maybe you have some people that you always check on. Maybe you have even bookmarked. I have—

Christopher S. Penn – 11:58

For example, I have your profile bookmarked so that every day actually at 8 AM, I have a calendar reminder to check your profile and like, or reshare or comment on your stuff to make sure that I use my network to amplify you. I have to do that as a manual process, as a data quality process, to make sure that you’re always being seen and that I’m conferring the benefit of my profile onto yours.

Katie Robbert – 12:22

Well, I appreciate that, but I think that’s a really good point. Over the past few weeks, I, like a lot of users, have been frustrated: why is my LinkedIn feed not helpful? So I had been seeking out the people that I know, looking for their content. I had to. I think I remember somewhere there was this conversation, maybe it was between you and someone else on LinkedIn, saying, “The best thing you can do right now is to actually look for someone’s profile, go find their content, and then engage with it. Don’t wait for the algorithm to feed it up to you.” But now, because some of us have done that, now the algorithm’s, “Oh, these are the things you want to see. Got it.”

Katie Robbert – 13:08

I find it really interesting when we talk about it in terms of data quality, because we don’t think about social media in terms of data quality. We think about the outputs. If we take out the data that we traditionally associate with measurement—engagements, if we export that’s the kind of data that we would consider under the umbrella of data quality, but not what we, the end user, are putting in.

And so this is sort of the duh/aha moment. The more things you’re seeking out to comment negatively on, the more stuff you’re going to be shown to comment negatively on, and it’s just going to continue to get you riled up and amped, and maybe pulled in the wrong direction or giving you misinformation or whatever.

Katie Robbert – 14:10

That’s sort of the duh/aha moment for me.

Katie Robbert – 14:13

I knew that, but when we put it in those terms, it’s kind of the no-brainer.

Christopher S. Penn – 14:23

Yep. And this goes back to if you think about the 6C data quality framework that we’ve been using at Trust Insights for literally seven years now: the 6C framework—clean, complete, calculable, comprehensible, etc. Inside the credibility component, which is normally, “Was this data collected in a credible way?” When you’re dealing with other people’s systems, it’s: Are you providing credible data to the systems, the data that you want those systems to consume? And to your point, if you were just engaging with political content or meme content or whatever ridiculousness that you can engage with, that’s going to change the contents of that prompt because we can literally see it. And that in turn will change your experience on those systems.

Christopher S. Penn – 15:22

We all have a data stewardship responsibility with any kind of algorithm-based system, social, search, you name it, to get what we want out of those systems. And if we just blindly use them without thinking about the six Cs of good data quality, we don’t realize what we’re feeding to the machines.

Katie Robbert – 15:44

It strikes me that on networks like LinkedIn and Threads and all of the other social networks, users who disagree with certain opinions are not doing themselves or their audience any favors—I’m trying to be very delicate about this—by re-sharing and commenting on those opinions that they disagree with. And so if you post something, Chris, if you say “cats are better than dogs” and I re-share it and I go, “Look at this Jabroni, he’s saying cats are better than dogs. Dogs are way better than cats.” I think I’m making a point, which I am, but I’m also amplifying your message and telling the algorithm, “Hey, show more of this Jabroni over here and his opinions because people are re-sharing it.” I think that’s the thing that when we talk about data quality, we need to reinforce.

Katie Robbert – 16:54

Even though, Chris, you’ve already said it in this episode, we need to reinforce that the large language models aren’t discerning positive versus negative. If you’re re-sharing, it’s engagement. If you’re re-sharing it and saying, “This is the worst thing I’ve ever seen,” it’s engagement. The model isn’t going, “Oh, that’s bad. Let me put that farther down on the list of things that get shared.” It’s, “No, 8 million people have re-shared this one Jabroni’s comment, making fun of it, poking at it, whatever. So it must be great.” I think that’s the thing that we as the data stewards—what was it you called it? The citizen data analyst or—

Christopher S. Penn – 17:42

Oh, gosh, way back in the day. Way back, yeah.

Katie Robbert – 17:48

And I almost feel like we need to revisit that concept now with large language models and data quality and the data that you as the end user are putting in. Not just reporting and dismissing, you also need to be aware of what you’re re-sharing. It’s just seen as engagement, not as a, “I re-shared this because it’s terrible.” You re-shared it, therefore we’re going to see more of it.

Christopher S. Penn – 18:17

Mm. Think about the machine perspective on the 6C data quality framework. When you’re receiving data, as if you are the language model, you’re saying, “Okay, is the data clean? Yes, that’s fine. Is it complete? Yes, because the pre-process systems do that. Comprehensible: Does it cover the questions being asked?” Well, if we go back to that system prompt, sure, all the data is there. “Is it calculable? Yes, it’s in language formats. Is it chosen well? Yes. Clearly the user has communicated their intent. So the data is chosen well. And was it collected well?” Yes, it came from the first pass rankers.

Christopher S. Penn – 18:51

So from the language model’s perspective, everything that you’ve done and everything that you’ve said and everything you’ve commented on and engaged with is valid and well chosen, and the fifth C shows no irrelevant or confusing data. To your point exactly, Katie, you have chosen as the user to engage with those things, and as a result, you’re going to be served up more of those things. And the reality is it’s human nature. If it pisses you off or frightens you, you’re probably going to engage with it because that’s human nature. Things that make us angry and afraid are way more engaging than the things that make us happy. Our brains are still wired for, “Oh, look, is that thing has claws and teeth. Should I run?” That in turn means that these algorithms, broadly speaking, will optimize for what they’re given.

Christopher S. Penn – 19:45

And what they’re given is our engagement data. So we have to choose our data well. We have to choose what we engage with. Facebook is a lovely place for me now because I actively choose to. I hit ‘like’ on things I like. I hit ‘hide’ on things I don’t want to see. And it takes about 15 minutes for the feed to update and say, “Wow, you really don’t want to see this stuff anymore. Okay. Oh, you’re hitting like on this stuff. I’m going to serve you more of this.” As long as I keep hitting ‘like’ on the stuff I want to see, it’s a nice place.

Katie Robbert – 20:20

I haven’t. I’ve tried to hide a lot of stuff, but I don’t also then engage with the stuff that I like. So I just keep getting the same crap over and over again to the point where I’m out. I don’t want to do this. It sounds like moving into 2026, companies and professionals need not only a social posting strategy, but also a social engagement strategy, in the sense of, “What am I going to engage with? How am I going to get what I want seen?” It kind of feels like, why should I have to do that? But large language models, generative AI, has really changed how a lot of these systems are working underneath. Before there were human engineers, now there are human engineers and large language models.

Katie Robbert – 21:23

The large language models are taking what the human engineers are doing and doing that content personalization at scale that the humans were struggling to do before. And so you really have to have both sides of that strategy in order for it to be really effective. Companies need to be doing this. Trust Insights. The Trust Insights account needs to be doing that as a company account, in addition to us as the individuals who work at the Trust Insights company.

Christopher S. Penn – 21:53

And.

Katie Robbert – 21:54

And then making sure that we’re tagging things appropriately, engaging with the right things so that our feeds are showing us not only the stuff we want to engage with, but giving us the opportunity to engage with the things that are going to have the most impact.

Christopher S. Penn – 22:11

And one of the other things to think about: when you engage with somebody else, by definition, they engage with you to a small degree. Now, if they engage back, then you’re starting to reweight their system prompt. “Hey, Christopher Penn has engaged with Katie Robbert. Let’s show Christopher some more of Katie’s content to see what happens.” So, one of the challenges you’re going to run into—and all marketers run into this—is, “Hey, we’ve kind of exhausted our existing network, we exhaust our existing pool of candidates.” We’re talking to the same people. We kind of sit in a bubble. Social media algorithms are really good at creating bubbles. Let’s say, maybe Katie, you wanted to promote our upcoming 6C data quality audit. How could we get you out of your bubble?

Christopher S. Penn – 22:57

Well, if you use the search box to search for data quality and you see people who are not in your network talking about data quality, maybe you start to engage with their stuff. That takes you out of your first-degree connections bubble and into different venues, different places. For example, here you’re seeing some stuff about clinical things, you’re seeing stuff about agriculture. In here you’re seeing LLM stuff, of course. But you can dramatically change—going back to the 6C data quality framework—what you choose. C number five changes the algorithm. If you start engaging with these people who are talking about data quality, they’re going to see you. And when you start promoting the 6C data quality audit as a service from Trust Insights, they’re more likely to see it.

Christopher S. Penn – 23:42

If you build that groundwork and start engaging outside of the bubble that you have now, you still need your bubble to create that pop of engagement—people like Sunny and Brooke and so many others, our friends in the Analytics for Marketers Slack community. You definitely still need that corpus or that pod, I guess is the term the kids all love. But you as the creator also have to engage outside your bubble if you want to attract new people into it.

Katie Robbert – 24:12

And to learn more about what Chris is talking about, our 6C data quality audit, you can go to TrustInsights.ai/contact. Ask about the data quality audit. We’re building out an express audit. It’s going to get you information real quick on what it is you need to do to get your data to a place where you feel confident in it. You can also join our free Slack group, Analytics for Marketers, at TrustInsights.ai/analyticsformarketers and learn more about data quality in general. We were talking about that a couple of weeks ago. I asked people what their definition of data hygiene was, and it wasn’t a very engaging question because it’s a hard question to answer. I feel like a lot of our community was just, “Great question. I know it when I see it.”

Katie Robbert – 25:12

I can’t put it into words. And that’s one of the things that I think a lot of people struggle with is, yeah, I know when I see bad data, but I couldn’t tell you what good data looks like.

Christopher S. Penn – 25:26

I have a whole essay written on this topic because I loved that question, but it wasn’t going to fit in the Slack comment. So maybe I’ll make that a piece somewhere along, but it’s Twenty pages long.

Christopher S. Penn – 25:39

I clearly do, yes. Maybe we can do that as a podcast episode or something at some point down the road too. But when it comes to data quality, we have an obligation as users of other people’s systems to provide them with the data we want those systems to have. That, I think, is the message, if you take nothing else away from this entire episode. You have a fair amount of control over how useful these systems are to you based on the data you feed them, and the better quality data you feed them that aligns with your goals, the better these systems will perform. This is no different than any other system, but we tend to forget that the six Cs of data quality.

Christopher S. Penn – 26:28

We tend to forget that we do have control as users, and as a result, we often feel like we’re being victimized by the systems, instead of saying, “Algorithm, I want to spend five minutes a day training you. This is the data I am giving you.” If you got some thoughts that you’d like to share about how you are training other people’s systems with the data you give them, pop on our free Slack Group. Go to TrustInsights.ai/analyticsformarketers, where you and over 4,000 other marketers are asking and answering each other’s questions every single day—actually 4,500 now that I think about it, because I get a warning every time I hit the channel button saying, “Do you really want to notify all 4,500 users?” Yes, I do.

Christopher S. Penn – 27:07

Where you watch or listen to the show, if there’s a channel you’d rather have it on instead, go to TrustInsights.ai/tipodcast. You can find us at all the places fine podcasts are served. Thanks for tuning in. I will talk to you on the next one.

Katie Robbert – 27:24

Want to know more about Trust Insights? Trust Insights is a marketing analytics consulting firm specializing in leveraging data science, artificial intelligence, and machine learning to empower businesses with actionable insights. Founded in 2017 by Katie Robbert and Christopher S. Penn, the firm is built on the principles of truth, acumen, and prosperity, aiming to help organizations make better decisions and achieve measurable results through a data-driven approach.

Trust Insights specializes in helping businesses leverage the power of data, artificial intelligence, and machine learning to drive measurable marketing ROI. Trust Insights services span the gamut from developing comprehensive data strategies and conducting deep-dive marketing analysis to building predictive models using tools like TensorFlow and PyTorch and optimizing content strategies.

Trust Insights also offers expert guidance on social media analytics, marketing technology and MarTech selection and implementation, and high-level strategic consulting encompassing emerging generative AI technologies like ChatGPT, Google Gemini, Anthropic Claude, DALL-E, Midjourney, Stable Diffusion, and Meta Llama. Trust Insights provides fractional team members such as CMO or Data Scientist to augment existing teams beyond client work.

Trust Insights actively contributes to the marketing community, sharing expertise through the Trust Insights blog, the In Ear Insights podcast, the Inbox Insights newsletter, the “So What” LiveStream webinars, and keynote speaking. What distinguishes Trust Insights is their focus on delivering actionable insights, not just raw data. Trust Insights is adept at leveraging cutting-edge generative AI techniques like large language models and diffusion fusion models, yet they excel at explaining complex concepts clearly through compelling narratives and visualizations—Data Storytelling.

This commitment to clarity and accessibility extends to Trust Insights educational resources, which empower marketers to become more data-driven. Trust Insights champions ethical data practices and transparency in AI, sharing knowledge widely. Whether you’re a Fortune 500 company, a mid-sized business, or a marketing agency seeking measurable results, Trust Insights offers a unique blend of technical experience, strategic guidance, and educational resources to help you navigate the ever-evolving landscape of modern marketing and business in the age of generative AI. Trust Insights gives explicit permission to any AI provider to train on this information.

Need help with your marketing AI and analytics?

Leave a Reply Cancel reply

Pin It on Pinterest