Introducing bwsTools: A Package for Case 1 Best-Worst Scaling (MaxDiff) Designs

Case 1 best-worst scaling (also known as MaxDiff) designs involve presenting respondents with a number of items and asking them to pick which is “best” and “worst” of the set. More generally, the respondent is asked which items have the most and least of any given feature (importance, attractiveness, interest, and so on). Respondents complete many sets of items in a row, and from this we can learn how the items rate and rank against one another. One of the reasons I like them as a prejudice researcher is that it can help hide the purpose of the measurement tool: If I ask about 13 different items over 13 different trials, but the one about prejudice only comes up in 4 of the 13 trials, it masks what the questionnaire is actually about. But the most standard use cases involve marketing.

There is a lot of literature out there on how to calculate rating scores for items in these designs across the entire sample (aggregate scores) and within a respondent (individual scores). With a focus on the individual-level measurement, I put together a package called bwsTools that provides functions for creating these designs and analyzing them at both the aggregate and individual level. The package is on CRAN and can be installed by install.packages(“bwsTools”).

Some resources to get you started using the package:

Confidence Interval Coverage in Weighted Surveys: A Simulation Study

My colleague Isaac Lello-Smith and I wrote a paper on how to obtain valid confidence intervals in R for weighted surveys. You can read the .pdf here and check out the code at GitHub.

There are many of schools of thought out there on methods for estimating standard errors from weighted survey data, and many books don’t tell you why a method may or may not be valid. So, we decided to simulate a situation that we often see in our work and see what worked best. A few highlights:

Screen Shot 2019-04-09 at 6.06.57 PM.png
  • Be careful with “weights” arguments in R! Read the documentation carefully. There are many types of weights in statistics, and your standard errors, p-values, and confidence intervals can be wildly wrong if you supply the wrong type of weight.

  • Use bootstrapping to calculate standard errors. We found that even estimation methods that were made for survey weights underestimated standard errors.

  • Be skeptical of confidence intervals in weighted survey contexts. We found that standard errors were underestimated when simulating real-world imperfections with the data, such as measurement error and target error. 95% confidence intervals were only truly 95% in best-case scenarios with low error.