Weight Weight…Don’t Tell Me (About Unrepresentative Data)
Let’s start with a hypothetical market research scenario.
We run the Minnesota Gizmo Company. Our marketing director wants us to collect data on Minnesotan’s gizmo opinions for a new advertisement campaign. Based on previous research, we know gizmo preferences significantly correlate with their age. As such, when we are sending out our market research survey we want to make sure that our sample is representative of the Minnesota population - with a particular emphasis on making sure that the age distribution of our sample approximates that of Minnesota’s population.
Let’s suspend our disbelief and say we have a list of every Minnesotans’ email address that we randomly sample from. We send out 10,000 emails and receive 5,000 completes. One of the questions asks respondents about their age; we use this information to calculate the age distribution of our sample. We create the table below to compare the age distribution of our sample to that of our population (perhaps based on the most recent, publicly available, census data).
Table 1. Population and Sample Age Proportion
Our sample doesn’t match the population! We want to make sure the survey results are fairly representing our entire potential customer base. As the data currently stands, we are worried the analysis will overrepresent the opinions of older Minnesotans and under represent the opinions of younger Minnesotans. What can we do?
Let’s Lift Some Weights
We can weight the data. Weighting data can be an effective option in situations when your sample differs from your population in ways that might impact the utility of the data. In our example, because it is so important to us that the sample data age distribution match the population’s age distribution, we decide to weight the data along respondents’ age.
How do we do it? There can be a lot of depth and mathematics involved in weighting, but a simple understanding of a couple of the most common techniques, when to use them and how may help guide you on your way to making sure your sample is representative of your population.
Simple Proportional Weighting
Simple proportional weighting might be the most well-suited for our Minnesota Gizmo Company example.
We use this weighing strategy to force the sample proportion of key adjustment variables to match the population proportions. For our needs, the key adjustment variable is age. We divide the population percentage by the sample percentage to find the correct weight that will increase the representation of the 0-35 age group and decrease the 61+ age group. Let’s go back to our table to illustrate this:
Table 2. Age Population and Sample Proportion with Weights
Using this method, we would assign respondents in the 0 - 35 age group a weight of 1.5, those in the 36 - 60 age group a weight of 1, and those 61+ a weight of 0.75. For respondents who have a weight of 1.5, their responses will be counted as if they were one-and-a-half people, for those with a weight of 1, their responses will be counted as if they are a single person, and for those with a weight of 0.75, their vote will count as if they were 3/4th of a person.
Once we apply these weights and analyze our data, we can assure our marketing department that they are getting a snapshot from a sample whose age reflects that of their target population.
At this point, you might be asking:
What would we do if we wanted to weight by more than just age?
Let’s say we wanted to weight by age and education, and each variable had 3 levels (three age ranges, 3 education levels).
We could break up each adjustment variable into its own table and calculate a weight based on each variable - then, multiply these together. We would calculate 3 age weights and 3 education weights. Then, for each case, their total weight would be the product of their age and education weights. It’s important to use caution however, as weighting with multiple variable may lead to the creation of weights that are larger or smaller than normally accepted (such as smaller than .03 or larger than 2.5) and reduce the precision of the data further.
Another possible option is a method called raking, which specializes in situations where we want to weight and have multiple adjustment variables intersecting, with unknown population proportions for each possible combination.
Let’s continue with our Minnesota Gizmo Company scenario from a different angle. The marketing department wants our survey sample to match the population along age as well as education. Each variable has three levels. If we make a table that illustrates all of the combinations of these two variables, we get the following:
Table 3. Population Table for Raking Education and Age
If we know the marginal totals of this table (in other the words, the total number of people in each age category, as well as the total number of people in each education category) we can rake. Raking involves a fair amount of mathematics, but programs like R can make quick, automated work of it using a raking algorithm. If you use a program to rake the variables in our example above, what will it do? Let’s use a flow chart to illustrate:
What else is there to know?
It’s important to note that the best strategy for having sample data that matches your population is to use a smart sampling strategy and avoid weighting. For example, in our Minnesota Gizmo example, having age quotas that force the final age distribution to match that of the Minnesota population may have been a better option than a simple random sample.
There are many different types of weights not explored here – design weights, propensity weights, matching weights, and more.
The smaller your n size of your adjustment variable(s), the more potential weighting has to add variance to your data. For example, if you are weighting 10 respondents to count for 100 respondents, fluctuations between respondents’ answers will have a much larger effect than if you are weighting 85 respondents to count for 100.
Weights can be mixed and matched. Some researchers, such as those at Pew Research Center, combine multiple types of weighting to make them more effective.
The adjustment variables that are used to weight the sample are often demographics variables, but do not necessarily have to be. Adjustment variables can be any variable for which researches deem it is important to try to make the sample proportions align with those of the population.
The right type of weight depends on many factors, including the context of the survey, the available population data, and the sampling strategy.
Acceptable weighting strategies vary across different industries and fields, and ideal weighting processes by strict academic standards rarely exist in the real world.
Lohr, S. L. (2010). Sampling: Design and analysis. Boston: Brooks/Cole.
Chakrapani, C. (2000). Marketing research: State-of-the-art perspectives. Chicago, IL: American Marketing Association.
How different weighting methods work. (2018, January 26). Retrieved from https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/
Battaglia, M. P., Hoaglin, D. C., & Frankel, M. R. (2009, June 01). Practical Considerations in Raking Survey Data | Published in Survey Practice. Retrieved from https://www.surveypractice.org/article/2953-practical-considerations-in-raking-survey-data
Survey: Computing Your Own Post-Stratification Weights in R. (2014, April 13). Retrieved from https://www.r-bloggers.com/survey-computing-your-own-post-stratification-weights-in-r/