Results Are Different to those from Another Program
Where differences are identified between results in Q and other programs, it will reflect either differences in the data or differences in how things are computed. Generally, it is advisable to first investigate differences in the data.
- 1 Differences in the data
- 2 Differences in how things are computed
- 2.1 Rounding
- 2.2 Weights
- 2.3 Means (averages)
- 2.4 Standard deviation
- 2.5 Effective sample sizes and design effects
- 2.6 Sample sizes
- 2.7 Differences in percentages
- 2.8 Regression models
- 2.9 Principal Components Analysis
- 2.10 Latent class models, mixture models, cluster analysis, trees (segmentation)
- 2.11 Significance tests
- 3 Obtaining assistance from Q in reconciling differences
- 4 See also
Differences in the data
Different sample sizes
The first thing to check when trying to reconcile data in Q versus another program is the sample size (in Q, Base n and, if the data is weighted, Base Population). Where the sample size differs it will mean that either:
- One of the files has been created at a different time to another (e.g., before or after more interviews have been conducted).
- Different data cleaning processes have be used.
- Different definitions of the base.
When data is exported from a data collection program into a data file various decisions are made about how to represent the data. If these rules have been inconsistently applied it can cause differences in the results (e.g., if in one program "No" responses have been outfiled as missing data whilst not in the other).
Differences in how things are computed
Q rounds up to the nearest integer. IBM SPSS products default to rounding to even numbers. Some crosstab products have two-stage rounding (they first round to the nearest decimal and then round to the nearest integer).
If a weight is applied in one program but not in another this will cause results to change. The way that different program address weights also impacts upon significance tests (discussed below).
Differences in means are generally attributable to differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters).
The standard formula for the sample standard deviation is:
where is the value of the th of observations.
This formula does not take weights into account. A simple modification of the formula is to treat the treat the th observation's weight, , as representing a frequency, which leads to:
This formula is widely used (e.g., in SPSS Statistics). However, it is incorrect in situations where the weights reflect the probability of a respondent being selected in a survey. For example, if the average weight is 1 this leads to a different standard deviation than if the average weight is 2. This formula is only used in Q where Weights and significance in Statistical Assumptions has been set to Un-weighted sample size in tests. Otherwise, Q instead uses the following formula:
Using Frequency Weights in Q discusses how to make Q apply the simpler formula. The SurveyAnalysis.org page on weighting for definitions of sampling weights and frequency weights. Also see Weights, Effective Sample Size and Design Effects for a more general discussion about weighting.
Effective sample sizes and design effects
As discussed in Weights, Effective Sample Size and Design Effects, Q has many different options for how effective sample sizes are computed and how design effects are factored in. Two particular things to keep an eye out for are:
- As discussed in the previous section, within Q, the design effect and effective sample size can be taken into account when computing standard deviations, whereas in many programs they are not taken into account.
- By default, Q automatically computes a design effect for weighted data and includes this in addition to any supplied in Extra deff.
Typically, different sample sizes will be caused by:
- Actual differences in the data.
- Weights containing missing values or 0s (Q automatically excludes these from sample size computations, but many other programs do not).
- Rounding. Some programs round sample size data. For example, SPSS either default to rounding sample size (e.g., in Custom Tables), or gives user options for controlling this.
- Q's treatment of missing data on multiple response questions (see Sample Size Seems Too Small).
- Selecting the wrong type of sample size in Q (in particular, refer to n, Base n, or Column n for un-weighted sample sizes, and Population, Base Population, or Column Population for weighted sample sizes). For example, a Count in SPSS is equivalent to Population in Q, or, to a rounded Population.
Differences in percentages
Pick One and Pick One - Multi questions
Differences in percentages on Pick One and Pick One - Multi questions are generally attributable to differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters), or, different bases This is most easily assessed by comparing sample sizes (using Base n).
Pick Any, Pick Any - Compact and Pick Any - Grid
- Differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters). This is most easily assessed by comparing samples sizes (using Base n).
- Where multiple response data is stored in the Pick Any - Compact format, some programs count repeated values twice. For example, in SPSS if one person chose the fourth option in two separate variables, they count as two people in the percentage of responses section, whereas in Q they do not. That is, Q computes percentages of respondents, whereas SPSS of responses.
- Different definitions applying to NET, Base and Total columns. As discussed at NET, a NET in Q is often not the same as a Base or Total in other programs.
- Pick Any - Compact questions having the wrong Question Type (i.e., being set as Pick One - Multi). See Multi-punch/Multiple Response Questions Displaying as Grids.
Differences between regression models produced in different programs tend to relate to:
- Treatment of missing values (e.g., SPSS has an option for Exclude cases pairwise whereas Q always uses listwise deletion).
- Treatment of weights. Most statistics programs (e.g., SPSS) interpret weights as being frequency weights, unless specific instructions are given to interpret the data in other ways. Q assumes that the weights are sampling weights (unless otherwise so instructed) and uses a Calibrated Weight. Identifying if the cause of the problem is weights is best achieved by running the analysis without weights.
- Whether standard errors are robust or not.
- Selection of type of regression model (e.g. using a linear regression in one program and a binary logit in another).
- Differences in the intercepts of regression models where missing is set to 'Use partial data (pairwise)'.
- Differences in the standard errors of weighted regression models where missing is set to 'Use partial data (pairwise)'.
Principal Components Analysis
Differences in PCA results between Q and other programs, and also between Q's different PCA implementations, are due to:
- Different treatments of missing values. For example, whether analysis is based on pairwise correlations or not.
- Arbitrary choices of sign. For example, one program may show all the loadings as being the negative of the loadings shown by another program.
- Specific algorithm (e.g., whether PCA or factor analysis is being conducted).
- Local optima in rotations. PCA itself is an exact algorithm; provided that the above-issues are addressed, the results should be the same. However, the rotation methods are not exact, and different programs can find different solutions.
Latent class models, mixture models, cluster analysis, trees (segmentation)
There is no reason to expect different programs to get the same results for any of latent class, cluster analysis and tree models. This is because:
- Most companies use slightly different statistical models, even though they have the same name. For example:
- There are multiple widely used k-means algorithms. For example, Q and R use Hartigan, J. A. and M. A. Wong (1979). "A K-means Clustering Algorithm." Applied Statistics 28(1): 100-108, but SPSS uses a different algorithm.
- There are multiple different mixture modeling algorithms (e.g., latent class). For example, when attempting to specify the same basic model, Q may use Maximum Likelihood estimation, Latent Gold may use Posterior Mode estimation and Sawtooth may use Hierarchical Bayesian estimation.
- There are dozens and perhaps hundreds of different tree algorithms.
- Even when the same algorithm is used, minor differences in implementation make results inconsistent. For example:
- Most mixture algorithms (e.g., latent class analysis) involve some element of randomization and differences in which random numbers are generated can change outputs.
- Different programs have slightly different stopping rules.
More generally, the reason that segmentation algorithms give different results is that all only return approximate solutions, as it is computationally infeasible to find the best solution for all but the most trivial problems. For example, with 1,000 respondents and 5 segments, there are 8,250,291,000,000 possible segmentations. As latent class allows for people to be partially in multiple segments, it permits an infinite number of segments. Consequently, when computer programs try and find segments, they start by making a few guesses, and one the key differences between the different programs relates to how they make those guesses, with things like the order of cases and variables in the data file being a determinant to how the initial guesses are made.
In Q, the best protection to avoiding having solutions that are difficult to replicate in other programs is to go into the Advanced options when creating the segments and set Number of starts to, say, 10 or 20, or a larger number if you have the time to wait.
Choice of test
Most programs use slightly different statistical tests. In particular, Q does not default to the tests that are standard in SPSS, Quantum and Survey Reporter, but often equivalent tests can be selected by modifying the options in Statistical Assumptions.
The role of weights
How weights are treated can have a major impact on computations of statistical significance. Most statistics program treat all weights as frequency weights (e.g., SPSS Base, R). Most market research programs assume that weights are sampling weights and use a Calibrated Weight when computing statistical tests (e.g., Quantum, Survey Reporter, Wincross, Uncle). Most specialized survey analysis programs (e.g., SPSS Complex Samples, R's survey package) uses special-purpose variance estimation algorithms for dealing with weights. Q uses a combination of special-purpose variance estimation and Calibrated Weights in its analysis (which is used and when is discussed for each test - see Tests Of Statistical Significance) and can be modified in Statistical Assumptions.
Additionally, some packages, such as SPSS Custom Tables automatically round weighted data to whole numbers prior to performing tests.
Multiple comparison corrections
By default, Q uses multiple comparison corrections on all tables. The specific correction used, the False Discovery Rate Correction (FDR) is not used by other market research programs. There is also no standard way to report the FDR, so the specific values of the corrected p-values differs from R, but does not affect conclusions. You can turn this off or select a different correction in Statistical Assumptions.
When performing the Multiple comparison correction in Column comparisons there are at least two different reasons for differences. First, there are differences between algorithms used to compute the studentized range statistics; in the case of SPSS versus Q, these differences are typically very small (e.g., in one program a p-value of 0.0135 may be computed whereas in another program a value p-value of 0.0132 may be computed). Second, where there are non-equal sample sizes in the groups being compared, these are treated differently as well (for example, when computing Homogeneous subgroups with Tukey’s HSD, Q’s formulas implicitly use the harmonic mean of two groups whereas SPSS computes the harmonic mean of all the groups).
Where using repeated measures data particular care should be taken as different program often make different assumptions regarding how the treat the likely occurrence of violations in the normality assumptions.
Upper versus lower case letters in Column Comparisons
Some programs show all results using upper-case letters when performing Column Comparisons. Some programs use lowercase to indicate results between 0.05 and 0.1 levels of significance and uppercase for p-Values less than or equal to 0.05. Q uses lowercase for results less than 0.001 and uppercase for more significant results.
Q uses Corrected p when determining whether to assign letter or not and whether these letters are UPPERCASE or lowercase.
Treatment of 0% and 100%
Some programs do not compute significance when performing comparisons involving either 0% or 100% (e.g., SPSS Custom Tables).
Obtaining assistance from Q in reconciling differences
If you require assistance in reconciling results obtained in Q with those obtained in other programs, please:
- Review this page and check that the issue is not described here.
- Send an email to support containing the following:
- A QPack (File > Share > Send to Q Support (Encrypted)).
- If using proprietary or internally-developed software, the actual algorithms used in the testing (i.e., the code). If these are not available, detailed formulas are needed. Please note that short descriptions such as "t-tests were used" or "chi-square tests were used" are not useful, as there are dozens of such tests, and there are no standard versions of these tests (e.g., the tests in introductory statistics books and on wikipedia are rarely used in commercial software). Similarly, descriptions written in non-technical language, such as descriptions referring to things like the "average" or the "total" are too ambiguous to be useful, or,
- If using well-known commercial software:
- The name of the application used to conduct the testing (e.g., SPSS Version 12).
- Information about any specific tests/options selected in the program. That is, either the scripts used to conduct the testing, or, screenshots of the options selected if using a program with a graphical user interface.
- Information from the program's technical manual about how the test are computed, which includes either the specific formulas used, or, references to formulas in books or journals.
- Detailed technical outputs provided by the other program. That is, most data analysis programs will contain options to export various information used in the calculation of significant results. For example, Quantum exports a tstat.dmp file. Please note that providing us with one or two crosstabs and noting inconsistencies in terms of what is marked as statistically significant does not constitute a detailed technical output. A detailed technical output needs to contain one or more of z, t or p statistics/values.
- Tables created by the other program.
- A list of a few specific examples explaining the differences. E.g., "On table 1 from the Quantum outputs you can see that it shows the 18 to 24s are significantly lower in their preference for Coke, but Q is not showing this."