Segmenting With Sparse MaxDiff Data
Maybe the 36 MaxDiff items were different species of trilobites?

Segmenting With Sparse MaxDiff Data

#MarketResearch, #ResearchMethods, #MaxDiff, #ChoiceModeling, #Segmentation, #SparseData, #LatentClass

 In some MaxDiff studies we have so many items that we can’t easily fit enough questions to show each item the recommended three to four times to each respondent.  In those cases, we often opt for a “sparse” MaxDiff, one wherein we show each respondent enough questions to see each item only once or twice. 

While even a sparse design can recover a sample’s true mean utilities with great fidelity, we know that our ability to recover respondents’ true utilities degrades as our MaxDiff design becomes increasingly sparse (Chrzan 2015, Chrzan and Peitz 2019. This should have implications for segmenting with MaxDiff data, but I’ve never quantified the deterioration in our ability to recover segments that sparseness may cause.  Recently this question has come up from a client, so I decided to look into it.

Method

I took an existing MaxDiff data set with a large number of items (36) for which I had previously run a four-segment solution using latent class MNL.  I took the mean MNL utilities for each segment and I treated them as the known (true) utilities for each respondent in each of four different segments. I copied each set of utilities into 200 rows of data.  Thus I have 200 respondents in each of four segments, for a total of 800 respondents whose four unique sets of utilities and whose segment memberships I know for certain. 

To create the MaxDiff data files for analysis, I programmed three MaxDiff experiments into our Lighthouse Studio software:  one with 100 versions of 27 sets of quads, one with 100 versions of 18 sets of quads and one with 100 versions of nine sets of quads.  Respondents in these three experiments will see each item three times, twice and just once, respectively.  Using the data generator functionality in our software, I had each of my 800 artificial respondents answer each of the three MaxDiff experiments, using their assigned utilities and a theoretically appropriate amount of random Gumbel response error. The result was three sets of responses from each respondent, one per experiment.

With both the experimental designs and the response data in hand, I ran latent class MNL on each of the three experiments to produce segments.  In all three cases the BIC fit statistic correctly identified a four segment solution, which was encouraging. When we compare known segment membership to the segment membership I estimate from these three analyses, however, we expect some degradation from sparseness: that respondents who see each item the recommended minimum of three times per item should fall into their known segment more often than if they see each item twice or once. 

Results

And this is exactly what happens.  I measured the accuracy of segment assignments using a standard metric called the Adjusted Rand Index (ARI) and I found his pattern:

Article content

As expected, the accuracy of segment assignments falls as sparseness gets worse.

A more intuitive way to report this might be to count how often each method puts respondents in the right segments, say by crosstabbing true and estimated segment membership.  For example, for the standard MaxDiff, where each respondent sees each item three times, we get this crosstab: 

Article content

I’ve highlighted the cells where the two segmentations match up.  If I sum the highlighted numbers and divide by the total sample size of 800, we can see that the latent class MNL put 92.25% of the respondents into the correct segments.  Doing the same for the other two experiments we can see the deterioration in accuracy of segment assignments that comes from sparseness:


Article content

Sure enough, as we show each MaxDiff item less often to each respondent, our ability to accurately segment respondents decreases. 

Conclusion

Of course, this analysis uses robotic respondents, but we have no reason to believe we’d have greater success with human respondents (in any case we never know the true segment membership of human respondents).

Also, this is just a single study with one particular pattern of between-segment differences, equal sized segments and respondents programmed to answer with equal amounts of response error.  This research might be interestingly expanded to include:

  • Experiments with different numbers of items
  • Designs with different amounts of sparseness
  • Populations with different numbers of segments
  • Segments with different patterns of utilities
  • Segments of differing sizes
  • Respondents with different amounts of response error

But those will be jobs for another day.


  References

Chrzan, K. (2015) “A parameter recovery experiment for two methods of MaxDiff with many items,” Sawtooth Research Paper available at https://sawtoothsoftware.com/resources/technical-papers/a-parameter-recovery-experiment-for-two-methods-of-maxdiff-with-many-items

Chrzan, K and M. Peitz (2019) “Best-Worst scaling with many items,” Journal of Choice Modeling, 30: 61-72.

Fact that recapture rates are so terrible for segments where everyone is a bullseye member of their segment (not possessing a range of classification probabilities as exist in the real world) is a true indictment of the sparser designs (anything less than 3x). That said, since MaxDiff struggles so much to predict holdout choices on the actual items themselves, you're starting with a flawed importance measurement to begin with. Go with Qsort and save respondents the pain of iterating through all those redundant-seeming choices.

Like
Reply

This is great work, Keith Chrzan, thank you for sharing this! It would be really interesting to see whether the drop in % correct from 2 full shows to 1 is linear from 18 tasks down to 9, or more of a cliff at the 1-show point. My guess is the decline gets much steeper as you get closer to only 1 show, since that's when you lose any connectivity of items across tasks and any sense of an overall ranking. I suspect that complete lack of connectivity and ranking beyond "chosen best", "not chosen", and "chosen worst" on each task is what makes it increasingly difficult for latent class to assign respondents to the correct segment.

Like
Reply

Good work, Keith, on a question that I've often been asked and have offered opinions on without running the simulations to support it with evidence. Your results suggest a bigger drop-off in segmentation assignment accuracy when each item is shown 1x vs 2x, as opposed to the lesser drop-off in accuracy when each item is shown 2x vs. 3x. This all feels right to me, from a touchy-feely standpoint. I'd still recommend showing each item 3x per respondent for strong segmentation analysis.

I wonder if the number of task versions has any effect on the results. Is it possible that with, say 10 versions, there could be a correlation between version and accuracy? That relationship would be hard to observe with 100 versions. I'm curious whether those that are scored incorrectly are that way at random due to nothing explainable, due to the random errors generated when creating their survey answers, due to differences in the designs and which specific comparisons they evaluated. More ideas for future research.

Like
Reply

To view or add a comment, sign in

More articles by Keith Chrzan

Others also viewed

Explore content categories