Solving Data Quality - a Step by Step Plan

There aren’t many things you can buy in which a 20-30% defect rate is totally normal. Picture yourself buying new car tyres for example: if you discovered one of them to be completely bald and dangerously close to a blow-out, you’d be furious. And the garage that sold them to you saying “sure, sorry about that, here’s a replacement” wouldn’t help at all – how is it OK to sell you a defective product in the first place?!

But that’s the state of market research. If you pay for 1,000 respondents to answer your survey, you’ll generally find 20-30% (sometimes as much as half) of responses are unusable: a string of meaningless answers from someone who hasn’t read a single question, or completed by ChatGPT. The sample provider will replace them, if you complain, but it’s just regarded as a fairly normal part of any research project.

Just to repeat that: the 1,000 respondents you paid for will contain around 300 responses that are absolutely useless, contaminating your overall results and findings; and it’s up to you to identify these responses and get them replaced.

How is this OK? How many research projects have gone off the rails because a significant proportion of the data they used was completely bogus?

I’ve written a couple of columns about the issue of data quality in market research, what it is and what it means for the industry – you can find them here.

The research industry is well aware of this problem, and has been discussing it for several years. Every sample vendor talks about how seriously they take sample quality; every research industry body has a data quality working group; and one of those bodies, ESOMAR, has introduced a framework of 37(!!) questions to help assess sample providers’ credentials. It all looks very serious, but the net result has been… limited. There’s greater awareness of data quality as an issue; but there’s been no meaningful shift in its prevalence. So, what actually needs to happen, to solve the problem of data quality in research?

Article content
Research 101: if you need 37 questions, you're probably doing it wrong

The starting point is acknowledging that while data quality is a shared problem affecting the whole research industry, we’re not all equally responsible. Just like a garage selling defective tyres, the distribution of responsibility isn’t exactly equal: sure, researchers need to stop buying bad sample, but more important is that sample vendors need to stop selling dreck. Once the research industry explicitly recognises this underlying truth, the steps to solving data quality become pretty straightforward.

Article content
Sample providers helping solve the problem of low-quality sample

First, the industry needs to apply a clear and serious standard to sample vendors. As a starting point, any panel or sample provider routinely selling more than 10% low-quality sample should be flagged as such by industry associations. In a stroke, this would give sample vendors a clear incentive to clean up what they sell, and ensure they’re perfectly aligned with sample buyers (ie research agencies and client-side researchers) in wanting to remove low-quality respondents from the sample supply chain. And I’d argue that any provider that can’t hit a target of 90% useable data is not serious about data quality. 

Second, “river sample”, that great blended sewage pipe of anonymous sample aggregated from who knows where, needs to end. It allows everyone in the supply chain – sample supplier, sample seller, sample buyer – to evade responsibility for the crap being sold as sample; and it needs to be shut down. Some companies will have to reconfigure themselves pretty sharply, but again, any business that relies on selling thousands of worthless responses to researchers is not a business that should be tolerated by the research industry.

Instead, the industry needs to move to a ‘know your respondent’ model of owned and operated research panels, in which the organisation selling the responses knows who each respondent is and vouches that they are in fact legitimate. Research panels are not inherently bad, but they have to be properly managed.  That means taking recruitment seriously, actively managing respondents’ survey loads, and checking their responses. It’s hard work, but it used to be standard for research panels – and there are promising signs that some panel vendors are going back to those fundamentals.

This also means that any respondent who’s repeatedly red-flagged for spurious answers or using ChatGPT needs to be blacklisted from the research panel: not just kicked off it, but blocked from rejoining.

The third step to solving data quality takes this further. A consistent problem with research panels is that of people signing up to multiple panels in order to complete more surveys and earn more cash for doing so. Overcoming that problem requires much closer collaboration between sample providers, for example in sharing their blacklists of low-quality respondents. I rarely see the blockchain as a solution to anything, but this kind of distributed list of research panellists could potentially be managed with a blockchain, tracking how many panels an individual has signed up to, whether they’ve been red-flagged for bogus responses and so on. In this way, persistent offenders can be managed, and their impact on the research industry minimised.

There are other steps that would also help; such as the previously standard practice of following up with respondents a couple of days after they complete a survey, to double check some of their responses. They involve extra work and cost, but add an extra layer of quality control to the research process.

And what can individual researchers do, to ensure they don’t get caught out with low-quality responses. Well, that’s pretty straightforward too.

  1. Don’t pay peanuts for your sample. Right now you can get 1,000 respondents for about $2,000 – don’t. It’s not just that your own research study will suffer, it’s that you’re perpetuating a business model that shouldn’t exist, of selling crap data.
  2. Do ask any panel or sample provider, bluntly, “what percentage of the responses I get are going to be garbage?”. And hold them to this – follow up with them if it’s higher; and if it’s more than 10%, ask them why. Then, thoroughly check your data. If you’re getting any low-quality responses, flag them with the panel provider, and tell them that you expect these respondents to be blacklisted from the panel. If you get >20% low-quality data, flag this with the panel provider – and be prepared to switch providers next time out.
  3. Look dispassionately at your research and survey design. Honestly, will it really encourage people answer authentically and truthfully? Will it bore them into wanting to race through it? What changes can you make to keep your questions engaging and interesting, and keep people answering honestly?

It takes hard work to do these things, both individually and as an industry; and it costs money to do research right. But it’s worth doing the extra work, and it’s worth paying for it. Data quality is the single biggest issue facing market research right now – if we can’t trust the data we gather, we can’t trust the insights and recommendations we produce – and it’s vital that we fix it.

Article content


John Ball, after all our chats about data quality and sample, you might find this interesting!

To view or add a comment, sign in

More articles by Nick Drew

Others also viewed

Explore content categories