Driver (Importance) Analysis

From Q
Jump to navigation Jump to search

Driver analysis computes an estimate of the importance of various independent variables in predicting a dependent variable. Most commonly, the dependent variable measures preference or usage of a particular brand (or brands), and the independent variables measure characteristics of this brand (or brands). For example, the dependent variable may be a measure of overall satisfaction and the independent variables may be measurements of satisfaction with bank fees, efficiency, friendliness, wait times, etc.

How to run driver analyses in Q

The most straightforward way to compute driver analysis is using Automate > Browse Online Library in Q 4.8.3 or later (QScripts > Online Library in Q 4.8.2) and choosing from the various options beginning with Regression - Driver (Importance) Analysis. Note that this method will treat all variables as being numeric, irrespective of their Variable Type and their Question Type.

Most variants of driver analysis are run using regression and consequently it is possible to compute some variants of driver analysis from Regression and Experiments. Also, some researchers use Correlation as a way of computing importance scores.

Methods

Linear Regression Coefficients

Linear regression coefficients are estimates of the sensitivity of the dependent question to changes in the independent questions (or, to use the language more commonly associated with regression, the independent variables and the dependent variable). For example, if a regression model reveals that the effect of a small change in price on sales is twice the effect of a similarly small change in advertising, then the importance of price will be computed as twice that of advertising.

In Q, importance is defined as the absolute value of the coefficients normalized to add to 1.0. More formally, where linear regression estimates the coefficient, [math]\displaystyle{ \beta_j }[/math] for the [math]\displaystyle{ j }[/math]th of [math]\displaystyle{ J }[/math] independent variables, where [math]\displaystyle{ x_{ij} }[/math] is the [math]\displaystyle{ i }[/math] of [math]\displaystyle{ I }[/math] observations on the [math]\displaystyle{ j }[/math] variable, [math]\displaystyle{ y_i }[/math] is the [math]\displaystyle{ i }[/math]th observed value of the dependent variable, [math]\displaystyle{ \epsilon_i }[/math] is an error term, and the model is of the form

[math]\displaystyle{ y_i = \alpha + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_J x_{iJ} + \epsilon_i = \alpha + \sum_j^J \beta_{j} x_{ij}+ \epsilon_i }[/math], and the importance of the [math]\displaystyle{ j }[/math]th variable, [math]\displaystyle{ P_j }[/math] is defined as [math]\displaystyle{ P_j := |\beta_j| / \sum^J_{j'}|\beta_{j'}| }[/math].

For example, if two coefficients are estimated as 7 and -3, their importances are, respectively, [math]\displaystyle{ |7|/(|7| + |-3|)=0.7 }[/math] and [math]\displaystyle{ |3|/(|7| + |-3|)=0.3 }[/math]

This method is referred to as importance when analyzing in Q using Regression.

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors of the linear regression coefficients used in the t-tests are obtained using the following formula:

[math]\displaystyle{ SE_{\beta_j}=\sqrt{\frac{\sum^I_i\epsilon_i^2}{I-J-1}(X^TX)^{-1}_{jj}} }[/math] where [math]\displaystyle{ X }[/math] is the regression design matrix of dimensions [math]\displaystyle{ I\times(J+1) }[/math] whose entries are [math]\displaystyle{ X_{ij}=x_{ij} }[/math] for [math]\displaystyle{ j=1,\ldots,J }[/math] and [math]\displaystyle{ X_{ij}=1 }[/math] for [math]\displaystyle{ j=J+1 }[/math].

Contribution

Contribution is a calculation of the extent to which an independent question explains variation in the dependent question in the data. Whereas the regression coefficient identifies differences in sensitivity, contribution is also influenced by the extent of variation of the independent variables. For example, the coefficient of price may be twice that of advertising, but the contribution of the advertising can be greater than that of price if the data contains lots of variation in advertising but little in price (i.e., if the data shows little difference in price, then price cannot have had a large contribution to the dependent variable, sales).

Driver analysis based on coefficients can be thought of as a measure of the potential of variables, whereas contribution is more of a measure of the historical impact of the variables.

Where [math]\displaystyle{ \beta_j }[/math] is the coefficient of the [math]\displaystyle{ j }[/math]th of [math]\displaystyle{ J }[/math] independent variables, and where [math]\displaystyle{ x_{ij} }[/math] is the [math]\displaystyle{ i }[/math] of [math]\displaystyle{ I }[/math] observations on the [math]\displaystyle{ j }[/math] variable, the contribution of the [math]\displaystyle{ j }[/math]th variable, [math]\displaystyle{ C_j }[/math] is defined as:

[math]\displaystyle{ C_j = \sum_i^I (\beta_{j} x_{ij} )^2 /\sum_{j'}^J\sum_i^I(\beta_{j'} x_{ij'} )^2 }[/math]

For example, with two coefficients estimated as 7 and -3, the respective independent variables and resulting calculations of contribution are shown in the table below.

[math]\displaystyle{ x_1 }[/math] [math]\displaystyle{ x_2 }[/math] [math]\displaystyle{ (\beta_1 x_1)^2 }[/math] [math]\displaystyle{ (\beta_2 x_2)^2 }[/math]
1 1 [math]\displaystyle{ (7 \times 1)^2 = }[/math]49 9
2 1 196 9
1 2 49 36
2 2 196 36
Total 490 90
Contribution [math]\displaystyle{ 490/(490+90)= }[/math]0.84 0.16

The contribution scores are scale-independent (i.e., transforming the independent variables by modifying their variance will not change the computed contribution scores).

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors of the contribution scores used in the t-tests are estimated using an analytical formula derived from the expression above for [math]\displaystyle{ C_j }[/math].

Beta

Beta here refers to the Beta scores provided in the Regression Outputs and should not be confused with the regression coefficients which are shown as [math]\displaystyle{ \beta }[/math] and are sometimes referred to in economics and other disciplines as "beta" scores.

Beta are standardized coefficients, and are computed as the regression coefficients where the dependent variable and independent variables have first been standardized to have a standard deviation of 1.0. The justification for doing this is to deal with situations where the variables have different or incomparable scales.

Additionally, when running Regression - Driver (Importance) Analysis - Beta they have been standardized to add up to 100% (as described above for the regression coefficients).

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors of the raw Beta scores used in the t-tests are those for the coefficients from a regression where the input variables have been standardized as described above.

Shapley

Shapley is a name commonly used in customer satisfaction and customer value analysis for describing a technique which has been reinvented multiple times.[1][2][3] Proponents of this method generally claim it is superior to the other methods because it better addresses the consequences of independent variables being correlated.

The method is best explained by example. Consider a simple driver analysis where the dependent variable measures preference and there are two independent variables, one measuring 'a good price' (PRICE) and the other measuring 'good quality' (QUALITY). It is possible to form three different regression models with this data:

  1. Predicting preference using only on PRICE (let us say that the R-square for this is 0.64).
  2. Predicting preference using only on QUALITY (let us say that the R-square for this is 0.16).
  3. Predicting preference using both PRICE and QUALITY (let us say that the R-square for this is 0.65).

With these three models we can compute two separate effects on R-square for PRICE:

  1. We can note that on its own, PRICE accounts for 0.64 / 0.65 = 98.5% of the explainable variance in preference.
  2. However, if QUALITY is already in the model, PRICE only accounts for the increment from 0.16 to 0.65, which is thus (0.65 - 0.16)/0.65 = 75.4%.

Which of these two is correct? If we assume that we have no way of knowing which is correct, we can take the average and thus compute the importance as their average (i.e., 86.9%). By the same logic we compute that the importance of QUALITY is 13.1%.

Where there are more independent variables the maths is the same, but we need to compute the average across more orderings (e.g., if we have 10 independent variables then we need to compute the R-squares across 1024 regressions). For this reason, each additional variable that is included slows down the computation of the Shapley value. For cases where there are more than 15 independent variables, it is suggested to use Relative Importance Analysis as it runs in a reasonable length of time, in contrast to Shapley, which could take a few minutes to a few hours. Furthermore, the computed Shapley Importance and Relative Importance Analysis yield highly similar results. The user will be prompted if they wish to conduct a Relative Importance Analysis instead in these cases except if more than 27 independent variables are requested. It is not possible to compute a Shapley Importance Analysis when more than 27 independent variables are provided due to limitations of the algorithm. In such scenarios, the Shapley Importance will automatically be converted into a Relative Importance Analysis since Shapley is not possible to compute.

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors for the Shapley values used in the t-tests are obtained from calibrating standard errors calculated for the Relative Importance Analysis measure, as the two measures are highly similar. This is an approximate method that works in practice and is much faster than other alternatives such as bootstrapping.

Kruskal

In the description of Shapley driver analysis above, the analysis looked at the incremental improvement in R-square when variables were added into the model. This incremental improvement is also known as the squared semi-partial correlations. A related statistic is the squared partial correlation and Kruskal's method is identical to 'Shapley', except that it uses the squared partial correlation instead of the squared semi-partial correlation (see [4]). To the best of our knowledge, there is no intuitively coherent way of describing the difference between Kruskal and Shapley (i.e., the difference only seems to be able to be described with maths).

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors for the Kruskal values used in the t-tests are obtained from calibrating standard errors calculated for the Relative Importance Analysis measure, as the two measures are highly similar. This is an approximate method that works in practice and is much faster than other alternatives such as bootstrapping.

The time required to compute Kruskal analysis results increases exponentially with the number of independent variables. As a result, Kruskal analysis may become noticeably slow from 15 variables onwards and may take minutes or even hours. Also, due to technical reasons, Kruskal is limited to 27 independent variables. Similar to Shapley Importance above, in such cases, it is suggested to use Relative Importance Analysis as it runs in a reasonable length of time. The user will be prompted if they wish to conduct a Relative Importance Analysis instead in these cases except if more than 27 independent variables are required. If more than 27 independent variables are provided the Kruskal Importance will automatically be converted into a Relative Importance Analysis.

Relative Importance Analysis

Relative Importance Analysis (also known as Johnson's relative weights) yields scores that are similar to Shapley importance and Kruskal importance, but takes much less time to compute. Relative Importance Analysis works by transforming the set of independent variables to a set of orthogonal variables that are not correlated with each other. It turns out that the squared regression coefficients from the linear regression using the orthogonal variables represent each variable's contribution to the R-square. The Relative Importance score for each independent variable is simply a convex combination (a weighted sum where weights add up to 1) of the squared regression coefficients, with the weights calculated based on the orthogonal variable transformation.[5]

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors of the Relative Importance Analysis values used in the t-tests are estimated using an analytical formula derived from the convex combination expression for each value.

Elasticity

In economics, it is commonplace to transform the data by taking the natural logarithm of the dependent and independent variables. The resulting coefficients are then referred to as 'elasticities' and have the interpretation that they indicate the relative effects, in percentage terms, of a change of independent variables relative to the dependent variable. For example, an elasticity of -3 indicates that a 10% increase in prices will result in a 30% decrease in sales. Typically, driver analysis is only conducted using elasticities when the data is behavioural (e.g., sales or purchasing data).

The statistical significance of scores is determined by t-tests as described in Statistical testing. The standard errors of the elasticities used in the t-tests are those for the coefficients from a regression where the input variables have been log-transformed as described above.

Jaccard Coefficient

When binary predictors and a binary outcome variable is used, the Jaccard Coefficient or Jaccard index between each of the predictors and the outcome can be computed. The Jaccard coefficient is a similarity measure of two variables that measures their overlap by computing the ratio of the intersection count of the cases to the count of non-zero cases. For two such binary variables, [math]\displaystyle{ X = \left( x_1, x_2, \ldots, x_n\right) }[/math] and [math]\displaystyle{ Y = (y_1, y_2, \ldots, y_n) }[/math], the formal definition of the Jaccard Coefficient [math]\displaystyle{ J(X,Y) }[/math] is

[math]\displaystyle{ J(X,Y)=\frac{|X\cap Y|}{|X \cup Y|} = \frac{\sum_{i = 1}^n x_i y_i + \sum_{i = 1}^n (1 - x_i)(1 - y_i)}{\sum_{i = 1}^n \max(x_i, y_i)} }[/math]

This computes the level of agreement or similarity between the binary variables by the ratio of their intersection to their union.

The statistical significance of these Jaccard coefficient is determined by a t-test. The standard errors of the Jaccard coefficients used in the t-test are estimated using a ratio estimator using a Taylor series expansion adjustment if necessary. The computed test statistic in the t-tests are used to determine the relative importance of the predictor variables.

Correlation

In the context of linear regression, the correlation between predictors and the outcome variable can be used to measure the importance of predictors via the bivariate product Pearson moment correlations. The statistical significance is determined via Taylor series linearization in the same manner as Correlation - Correlation Matrix

Choosing a driver analysis method

The various driver methods available in Q have been created due to frequent requests by clients. It is unclear if a specific method is better than the others, or even if any of the methods are valid. However, of the methods available in Q, Shapley seems to be more often described as being "state-of-the-art" (e.g., [6] ).

Dealing with multiple brands

Where there is a need to conduct driver analysis across multiple brands (e.g., if preference is measured for each of 10 brands), this can be achieved by first Stacking the data file. In situations where the measure of preference reflects the relative appeal of different brands (e.g., a choice of favorite brand), the most practical solution is likely to be to estimate an Experiment and manually compute importance scores from the coefficients.

Categorical variables

When categorical variables are selected as input variables in driver analysis, linked numeric variables are created using the raw numeric data from the categorical variables, which are then used in the analysis. It is essential that the category values are such that the resulting numeric variables make sense. For example, the category values for a Pick One question with age categories 18 to 24, 25 to 29, 30 to 34, 35 to 39 would usually only make sense if they were monotonic, e.g., 1, 2, 3, 4.

Note that the treatment of categorical variables is different from that in Regression - Generalized Linear Model, which runs the analysis on dummy variables from the categorical predictors instead of converting them to numeric.

Score signs

For Q 4.9.1 or later, (positive and negative) signs are applied to driver analysis scores to match the signs of the corresponding linear regression coefficients from the model including all of the independent variables. Thus the direction of the influence of each independent variable is presented in the scores, in addition to the magnitude. The magnitudes of the driver analysis scores (except for Elasticity) are normalized to sum to 100%. Statistical tests are conducted on the signed raw scores, and the value of test statistics may be different from previous versions, resulting in different test results.

See also Why Do Shapley and Kruskal Driver Analysis Have Negative Scores?.

Statistical testing

The significance of each cell is determined by conducting a t-test comparing whether the raw score in the cell is statistically different from zero. The t-statistic in this case would be the signed raw score divided by the standard error of the raw score. The standard error is the standard deviation of the raw score estimate. When a Pick One, Pick Any or Date question is selected in the brown drop-down menu to form a crosstab, a t-test is conducted comparing whether the raw score in the cell is statistically different from the raw score obtained from the complement of the category, as explained in Testing the Complement of a Cell.

When we conduct the t-tests, we assume that the raw scores are normally distributed. In reality, the raw scores for Contribution, Shapley, Kruskal and Relative Importance Analysis are not normally distributed, since they cannot be less than zero. Even though the raw scores cannot be less than zero, the estimated standard errors are often sufficiently large that under the t-test, the raw scores are often determined to be not significant from zero.

Confidence intervals for each raw score may be displayed by selecting the Lower Confidence Bound and Upper Confidence Bound cell statistics. The intervals are computed using the standard errors of the raw scores with the assumption that the raw scores are normally distributed. The intervals are constrained to be between 0 and 1 when the raw scores are bound by these constraints (e.g. Shapley).

Also known as

  • Importance analysis.
  • Key driver analysis.
  • Preference regression.

See also

QScripts:

References

Template:Reflist

Further reading: Key Driver Analysis Software

  1. Grömping, Ulrike (2007). Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. The American Statistician, 61, 139–147.
  2. Michael Conklin, Ken Powaga, Stan Lipovetsky (2004). Customer satisfaction analysis: Identification of key drivers. European Journal of Operational Research, Volume 154, Issue 3, 1 May 2004, Pages 819-827
  3. Lindeman, R. H., Merenda, P. F., & Gold, R. Z. (1980). Introduction to bivariate and multivariate analysis. Glenview, IL: Scott, Foresman and Company.
  4. Kruskal, W. (1987a), “Relative Importance by Averaging over Orderings,” The American Statistician, 41, 6–10.
  5. Tonidandel, S., LeBreton, J.M., Johnson, J.W. (2009), Determining the Statistical Significance of Relative Weights, Psychological Methods, Vol. 14, No. 4, 387-399.
  6. Grömping, Ulrike (2009). Variable Importance Assessment in Regression: Linear Regression versus Random Forest. The American Statistician, November 2009, Vol. 63, No. 4.