Weights in R
This page describes how weights are addressed when using R from within Q.
All Standard R will automatically detect and apply weights. However, if creating or using custom-written R code, you will need to explicitly control how weights are applied.
R Outputs have access to two weighting objects:
- QPopulationWeight. This contains the values of the weight variable.
- QCalibratedWeight. This contains the Calibrated Weight.
Approaches to using weights when writing R code
In R, there is no standard way of addressing weights. While many R functions have a weights parameter, there is no consistency in how they are intepreted:
- Most commonly, weights in R are interpreted as frequency weights.
- Occasionally they are interpreted as sampling weights (e.g., in the survey package).
How to adapt existing R functions developed for frequency weights to deal with sampling weights
Using sampling weights in a function written for frequency weights will typically have the following consequences:
- Parameter estimates will be appropriate. For example, if using sampling weights in lm or glm, you will correct parameter estimates, even though the weights parameter assumes the weights are frequency weights.
- Computations of inference, such as p-values and standard errors will be wrong.
There are a variety of solutions to this problem.
Rewriting the functions to deal with sampling weights
For example, using Taylor series expansions to compute standard errors. While this is the best approach, and has been implemented in the survey package, it is the most complex of the approaches.
This approach involves scaling the weights in a manner such that the inference is not "too bad" (see Calibrated Weight).
This can be the most pragmatic approach when dealing with weights in multivariate methods, where inference is only of a secondary concern. In Standard R, this is used for most multivariate methods. It is only used in the following regression methods, as the rest employ Taylor series expansions: Regression - Multinomial Logit, Regression - Ordered Logit, Regression - NBD Regression).
The function CalibrateWeight in flipData calibrates weights.
Note that the QCalibratedWeight and the weights computed using CalibrateWeight will not necessarily be the same, as:
- QCalibratedWeight is computed on the entire data file. If QCalibratedWeight is computed on a subset of the data, the results will be different. Consequently, it is often appropriate to apply the CalibrateWeight function to QCalibratedWeight once cases have been filtered and missing values removed.
- QCalibratedWeight will automatically assign a weight of 0 to negative and missing values of weight variables. By contrast, CalibrateWeight will produce an error if such values are encountered.
- QCalibratedWeight takes other settings in a project into account, such as design effects (see Weights, Effective Sample Size and Design Effects).
Stratified weight calibration
This approach involves stratifying the data and applying calibration within each strata. For example, if performing a two-sample t-test with the assumption of unequal variances, the calibration can be performed within each of the samples.
This approach involves creating a new synthetic data set by randomly selecting cases, with replacement, from an existing data set. Cases are selected with probability proportional to the weight. That is, a weighted bootstrap is used to create the data set. This can be done using flipTransformations::AdjustDataToReflectWeights.
This approach is often better than weight calibration where the goal is inference, but the parameter estimates are less precise (due to the noise added from the randomization). This approach is always inferior to rewriting the functions to correctly deal with weights (e.g., via Taylor series linearization).
Where applying resampling, it is a good idea to:
- Calibrate the weight, as otherwise the sample size will be exaggerated.
- Set the random number seed, so that the same answer is given each time the function is used.
- Give the user a way of changing the seed so that they can assess sensitivity.
In Standard R, this approach is typically used for items in the Test sub-menu wherever Taylor series expansions have not been computed. Where a test is being conducted, the resampled sample size will typically be size of the rounded effective sample size (after removal of cases with missing values, if applicable). Where a test is not being conducted (e.g., Machine Learning - Random Forest), the resampled sample size will match the original sample size.
For a small number of analysis methods, such as hierarchical cluster analysis and distance calculations, weights are and should be ignored, where the calculations are based on differences between individual cases.