I first came across synthetic respondents in the market research industry a couple of years ago. The idea was simple: in a world where there is a constant pressure from clients for bigger respondent samples, and a rising cost of finding respondents to fill these sample quotas, why not use a large language model (tuned using real respondent data) to “mimic” those human respondents for qualitative research projects, generating the required response size for almost no extra cost?
It seemed too good to be true. And after diving into the details, and getting views from many experienced colleagues and peers in the industry, I am convinced that it is.
I am now certain that while there are many ways in which large language models - including those tuned on real respondent data - can add value to a market research project, using them to solve the sample size issue is beyond problematic, and in fact, will almost certainly lead to worse insights, worse decision-making, and ultimately, a worse service for clients.
Statistics and Extrapolation
As any statistician or data professional will tell you, there is a core assumption that underpins all of data analysis and statistical modelling: the idea that any real-world data you collect does not exist in isolation, but instead should be treated as a set of samples drawn from some underlying distribution.
As an example, imagine that you are a marketing professional looking at month-by-month counts of visitors to your website. Some months you’re up, some months you’re down, and although these numbers are controlled by a lots of different factors, statisticians would assume that they are essentially just random draws from a certain type of probability distribution - like counting outcomes from rolls of a die, or flips of a coin.
While this may seem peverse, assuming the underlying distribution of a set of data is very helpful, because it allows us to treat the data as a mathematical equation, giving us access to some really powerful tools that we wouldn’t otherwise be able to make use of. For example, in the case of our monthly website visitor counts, this assumption allows us to start to extrapolate from the data we have, predicting answers to questions like
- “what is the monthly number of visitors likely to be in June?”
- “how many visitors are we likely to get next year?”
- “if we improve our average visitors by 10%, what impact will that likely have over time?”
If we simply assumed that the data was the data, and didn’t model what the underlying pattern might be, then none of these questions would be answerable using our clever mathematical tools - we’d be reduced to guessing based solely on the data we already had.
As a side note, this approach of making no assumptions about the underlying data is actually possible and can be extremely useful. In fact, it’s how many powerful machine learning algorithms - including LLMs - work. However, the big drawback is that it requires an enormous amount of data to produce useful results, and in our market research use case, this simply isn’t possible.
Distribution of Data: a Robust Field
The assumption of the shape of the underlying distribution that statisticians make is not just a random shot in the dark. There are many methods and models that are tried and tested for determining how closely your data matches the distribution you think that it’s coming from, and it’s often the case that you have to use a combination of statistical testing and pragmatic thinking to understand where your data really comes from.
For example, our monthly website visitor count data is likely to belong to a Poisson distribution, the distribution that is typically used to model the number of times a discrete event occurs within a fixed time frame. First named by Siméon Denis Poisson in 1837 (although possibly discovered over 100 years earlier by Abraham de Moivre), it’s used to model everything from the decay of a radioactive sample, to yeast cells used when brewing Guinness, to the locations of V-1 bombs landing in London during the Blitz.
In other words - we know it works, because it’s been proven to be useful in many cases over multiple centuries. There are even mathematical tests we can carry out to increase our confidence that the data we have matches the distribution we think it comes from.
My point is this - assuming that your data are random draws from a known probablility distribution is something that there is over 300 years of robust mathematical study into. It’s not a shot in the dark - it’s an extremely well-defined methodology that underpins a huge amount of data analysis and statistical modelling in the modern world.
Generating Synthetic Data
But how does this relate to synthetic respondents?
Well, once you are pretty certain about the equation that underpins the shape of your sample data, you might well start to ask, “can I use this equation to generate more data like my sample?”
The answer, broadly, is yes! If your modelling approach is robust enough, generating new samples - synthetic data - from your distribution is a statistically sound thing to do.
Why bother? Well, if your data contains sensitive or personal information, then generating a set of synthetic data might be a safe approach for high-level analysis. You also may want a larger sample of data to look deeper into any over- or under-represented sections of the data, particular when looking for ways to mitigate bias, and it could also be used as a way to model specific edge cases, where you have methods that require bigger data sets than what you currently have.
What’s key to this approach, however, is that it only works because of the robust, tested, well-known statistical methods that you used to to ensure that your sample data and generating distribution match. If you haven’t done this, what you’re producing is just noise - garbage data that you have no way of knowing if it’s related to your original sample or not.
Speaking of which…
The Lie of Synthetic Respondents
The idea of a synthetic respondent is to apply this same methodology to a qualitative sample of market research. You take a sample of human responses to your questions, apply a model - in this case, a large language model - to the data, and then ask it to generate more responses. Should work just as well, right?
Well, not really. While this might be an interesting way to explore the data you already have (and there are plenty of other useful things you can do with this approach), it’s a terrible way to try and “boost” your sample. Here’s why:
You can’t verify your assumptions
There are no probability distributions designed to measure or model qualitative human opinion. There is no mathematical equations to model what someone might say next, so there’s no way for us to verify that what we’re producing from our model is from the same underlying distribution as our original sample. Everything is purely guesswork and opinion.
You don’t have a big enough sample
We can’t assume an underlying distribution, but as mentioned above, there are lots of clever machine learning models that don’t - in fact, large language models seem to be able to model human speech pretty accurately. However, large language models are trained on literally billions of data points, which allows them to find the structure in the data itself. Your sample of 25 respondents is a drop in the ocean compared to that, so even fine-tuning with this data will lead to outputs that are far more biased towards the original training data (read: Wikipedia) than those respondents whose opinions you’re interested in.
It defeats the whole point of market research
The role of market research and insights in decision-making is to understand the opinions of potential customers/users when making product decisions. Your hope - your goal - is to find a new opinion, something that surprises you or moves you. A synthetic respondent will never give you a new or surprising opinion - they will simply give you an amalgamation of the opinions of those in their training set.
My experienced, senior market research colleagues have told me that qualitative research is not about “60% of the sample said they preferred this so we should go with it”. Instead, it’s about panning for gold - finding the story, those little phrases or feelings that capture the essence of why a brand is strong, or a product is disliked, in amongst the rest of the noise. Adding synthetic respondents just creates more noise, making that process of sifting the sand harder, not easier.
Using LLMs for Good
So “boosting” a sample with synthetic respondents is not only pointless, it also makes your job as a market researcher harder. But surely this type of tooling can be useful in some way?
I think certainly think it can. Market research is a data-rich process, but until a few years ago, we didn’t have the tooling to analyse large qualitative data sets in the way we’ve been able to do so with quantitative data. So here are three ways that I think large language models will change the way that qualitative research will be done in the future:
- Automating steps in the process: many businesses have already adopted AI transcription and translation tools, but that’s only the first phase; building discussion guides, creating primary topic analysis, and even synthetic moderation are all areas that are ripe for AI automation (and for which third-party tools already exist on the market).
- Enhancing deliverables with interactivity: one interesting way you could use this methodology is to allow your clients to directly interact with respondent personas; give them the chance to ask your LLM why they prefer X brand over Y, or what would make them switch (assuming these are questions you’ve asked your real respondents).
- Joining up previous research: lots of research businesses, and buyers of research, have built up large libraries of knowledge on different topics. These libraries, however, are often not maintained or stored well, so knowledge is lost to bad organisation. Joining up these libraries with an LLM “librarian” could allow access to huge amounts of lost knowledge, creating efficiencies through time saved and smarter decision-making overall.
Conclusion
I am not a luddite (clearly, or I’d be in the wrong job). I have no doubt that large language models will change how I work, how the market research industry works, and how many parts of the world work, in ways that we cannot even fathom right now.
I am also not an expert in market research, but I do know enough about statistics and data science to know that large language models are mathematically incapable of having an original thought or opinion. And in an industry which relies on the gathering and understanding of real opinions, the idea that a synthetic respondent could be anything but detrimental to that is utter fantasy.
So please, don’t let these AI models pollute your real, gold-standard, human opinion-gathering with hallucinated rubbish. Resist the temptation of the big lie of AI: that synthetic respondents will ever be anything other than a poor imitation of a real human being.
Fervently disagree with me? Think I’ve hit the nail on the head? Simply want to discuss this topic further? Feel free to reach out to me via our contact page, or directly on LinkedIn.