Create New Variables - Automatically Combine Categories - By Pattern (CHAID) - Any Categories

From Q
Jump to navigation Jump to search

This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable. For instance, you may have a variable that contains categories for the occupation of respondents in a survey, and you may want to group those occupations based on those with similar income distributions or age distributions.

The CHAID algorithm is used to obtain the solution. This tool does not consider whether the categories are ordered, and so will combine any categories which are similar at the specified significance level(s).

Example

Consider the following table which shows categories of Kickstarter projects in the columns, and whether the projects were successful, failed, cancelled, suspended, or live in the rows:

.

We may want to create new categories by combining those Kickstarter categories which have the most similar pattern of success and failure. Applying the Automatically Combined Catergies - By Pattern (CHAID) feature (along with some changes to the settings to encourage the algorithm to be more agressive in how many merges it does) results in combined categories as shown in the new table:

We see that Music, Comics, Theatre, and Dance categories have been combined as those with the highest relative success rate, Food, Crafts, Journalism, Fashion, and Technology have been combined as the categories with the highest relative failure rate, and other patterns in between for the other combined categories.

Usage

  1. In the Variables and Questions tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Pick One or Pick One - Multi questions.
  2. Select Automate > Browse Online Library > Automatically Combine Categories > By Pattern (CHAID) > Any Categories.
  3. To change how the categories are combined:
    1. Select the new variable or question in the Variables and Questions tab.
    2. Right-click and select Edit R Variable.
    3. Choose the desired options in the Inputs section on the right.
    4. Click Update R Variable.
  1. Under Data Sets tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Nominal / Ordinal, or Nominal / Ordinal - Multi variable sets.
  2. Click the + symbol to the right of the selected variables.
  3. Select Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Any Categories.
  4. Choose options for the new variable in the Inputs section on the right of the screen.

Options

Variable This is the variable whose categories you wish to combine.

Combine by Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to By Value if combining numeric data, or By Geography if your data contains geographic locations (zip codes, states, cities, etc).

Based on This is the variable that you want to compare with the first Variable above. The categories of the variable selected in Variable will be combined based on the similarity of their distributions of this Based on variable.

Weight Select a weight variable here if you wish to apply the weighted version of CHAID. This will combined categories based on the weighted distributions.

CHAID ALGORITHM SETTINGS

Combine The option to choose which pairs of categories are permissible to combine. The options are:

Any categories: It is permissible for each category to combine with any other category.
Adjacent categories: It is only permissible for each category to combine with adjacent categories. Unless one or more categories are specified in the Unordered categories control. In that case, the categories specified in that control are permitted to combine with any other category and not restricted to adjacent categories.
Adjacent categories unless missing value code: The same behaviour as Adjacent categories except if there are any categories which are coded with a value of NaN in the Value Atrributesset to Include in percentages (but not averages) in DATA VALUES > Values. Then those categories will always be considered as unordered.
Using variable set structure: The permissible combine options are determined by the Variable typeVariable Set Structure of the input variable. If the input variable is CategoricalNominal then Any categories are permissible to combine. If the input variable is Ordered CategoricalOrdinal then Adjacent categories are permissible to combine.

Unordered categories This control only appears if the Combine option is Adjacent categories or Adjacent categories unless missing value code. This control gives the ability to specify if particular categories should be considered as not ordered and allowed to combine with any other category. Other categories not entered here can only combine with adjacent ones. This is appropriate if the input variable contains ordinal values on a scale but some options are not ordered. For example if the categories are on a scale with options, 'Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree', "Don't know" and 'I refuse to answer' then "Don't know" and 'I refuse to answer' can be identified and then permitted to combine with any of the other categories. The two category labels would need to be typed into this control and separated with a ';' or ','. Note that if Combine is set to Adjacent categories unless missing value code, then any category which is coded with a value of NaN in the Value Atrributesset to Include in percentages (but not averages) in DATA VALUES > Values will always be considered as unordered.

Use Exhaustive CHAID This controls whether the Exhaustive CHAID algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is Usually, which means that Exhaustive CHAID will always be used unless your Variable has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case you can ensure the exhaustive algorithm is applied by changing this setting to Yes.

Minimum category size The CHAID algorithm will not produce new categories which have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.

Alpha level to combine categories This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change Use Exhaustive CHAID to No).

Alpha level to validate final combined categories This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the Based on menu which has a greater level of variation with the main Variable, or you can increase the value of this setting.

Multiple Comparison adjustment This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the Alpha level to validate final combined categories. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.

Technical details

CHAID stands for Chi-square automatic interaction detection. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)[1] and Biggs, D., Ville, B., and Suen, E. (1991)[2] for more details.

The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.

The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of the those solutions by identifying the solution with the smallest p-value.

When weights are used, the second order survey weight adjusted test of independence of Rao and Scott (1984)[3] is used instead of the standard Pearson Chi squared test.


Bonferroni adjustments

If the Multiple comparison adjustment option is selected, then the significance test to assess the significance of the final state of the combined categories from the CHAID algorithm is adjusted by the number of pairwise tests conducted during the combining of each category. This adjustment is a Bonferroni type adjustment that is computed differently for the standard CHAID algorithm against the exhaustive CHAID algorithm. The standard algorithm terminates if there are no pairwise tests that are above the significance level. While the exhaustive algorithm will combine a category with another category until only two categories remain. From the set of states generated in the exhaustive algorithm, the state with the smallest p-value is considered the optimal configuration and becomes the final combined category solution.

In the sections below, details about the the Bonferroni adjustments used for both the standard CHAID algorithm and the exhaustive CHAID algorithm. In each section, the detailed adjustments for the Combine option allowing Any categories to combine or only Adjacent categories are given. The latter also considers a more refined adjustment when some Unordered categories are specified in the Adjacent categories option in Combine. More possibilities are explored and therefore a larger Bonferroni adjustment is required when some categories are allowed to combine with any other category in the Adjacent categories option. Define the initial number of categories in the variable as [math]\displaystyle{ c }[/math] and the final number of reduced categories from the combined solution as [math]\displaystyle{ r }[/math]. Then the Bonferroni adjustment is denoted [math]\displaystyle{ B(c,r) }[/math] for each of the possible scenarios below (assuming of course that [math]\displaystyle{ 1 \le r \le c }[/math] are integer valued).

Standard algorithm

The standard algorithm follows the Bonferroni adjustment approach used in Kass (1980)[1]. Here the adjustment considers the number of possible arragements from reducing [math]\displaystyle{ c }[/math] categories into [math]\displaystyle{ r }[/math] categories. In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control. Then this is solved by a result of partitions. In particular, Stirling numbers of the second kind [4] gives the number of ways to partition a set of [math]\displaystyle{ c }[/math] categories into [math]\displaystyle{ r }[/math] non-empty subsets as [math]\displaystyle{ \left\{ \begin{smallmatrix} c\\ r \end{smallmatrix} \right\} }[/math] and takes the role of the Bonferroni adjustment value for the case of Any categories. In particular,

[math]\displaystyle{ B(c,r) = \left\{ \begin{matrix} c\\ r \end{matrix} \right\} = \frac{1}{r!}\sum_{i = 0}^r (-1)^i\binom{r}{i}(r - i)^c, \qquad \left\{ \begin{matrix} c\\ c \end{matrix} \right\} = 1, \quad \text{ and for } c \ge 1, \quad \left\{ \begin{matrix} c\\ 1 \end{matrix} \right\} = 1. }[/math]

For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by,

[math]\displaystyle{ B(c,r) = \binom{c - 1}{r - 1}. }[/math]

In the case when there Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are [math]\displaystyle{ u }[/math] unordered categories (including categories coded as missing), with [math]\displaystyle{ 1\le u \lt c }[/math] then the Bonferroni adjustment is,

[math]\displaystyle{ B(c, r, u) = \sum_{s = 0}^u \binom{c - u - 1}{r - s - 1}\sum_{i = 0}^{u-s}\binom{u}{i}\left\{ \begin{matrix} u - i\\ s \end{matrix} \right\} (r - s)^i }[/math]

Exhaustive algorithm

The exhaustive algorithm follows the Bonferroni adjustment approach used in Biggs, D., Ville, B., and Suen, E. (1991)[2]. Here the adjustment considers the number of tests conducted as the algorithm traverses from the full set of [math]\displaystyle{ c }[/math] categories down to two categories.

In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control, the Bonferroni adjustment is,

[math]\displaystyle{ B(c,r) = \sum_{k = 2}^c \binom{k}{2} }[/math]

For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by,

[math]\displaystyle{ B(c,r) = \binom{c}{2}. }[/math]

In the case when there Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are [math]\displaystyle{ u }[/math] unordered categories (including categories coded as missing), with [math]\displaystyle{ 1\le u \lt c }[/math] then the Bonferroni adjustment is,

[math]\displaystyle{ B(c, r, u) = \binom{c - u}{2} + \sum_{i = 0}^{u - 1}\frac{c - i}{2} \left( 2 c - u - 1 - i\right). }[/math]


Differences to SPSS CHAID

When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.

In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a tie. This algorithm will attept to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS have not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.


References

  1. 1.0 1.1 Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 20, 2, 119-127. doi: https://doi.org/10.2307/2986296
  2. 2.0 2.1 Biggs, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62. doi:https://doi.org/10.1080/02664769100000005
  3. Rao, J. N. K. and A. J. Scott (1984). 'On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.' The Annals of Statistics, 12, 1, 46-60. doi: https://doi.org/10.1214/aos/1176346391
  4. Stirling Numbers of the second kind (2022). Retrieved June 9, 2022, from https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind

How to apply this QScript

  • Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
  • Click on the QScript when it appears in the QScripts and Rules section of the search results.

OR

  • Select Automate > Browse Online Library.
  • Select this QScript from the list.

Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

Customizing QScripts in Q4.11 and more recent versions

  • Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
  • Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
  • Press Edit a Copy (bottom-left corner of the preview).
  • Modify the JavaScript (see QScripts for more detail on this).
  • Either:
    • Run the QScript, by pressing the blue triangle button.
    • Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.

Customizing QScripts in older versions

  • Copy the JavaScript shown on this page.
  • Create a new text file, giving it a file extension of .QScript. See here for more information about how to do this.
  • Modify the JavaScript (see QScripts for more detail on this).
  • Run the file using Automate > Run QScript (Macro) from File.

JavaScript

includeWeb('QScript Functions for Automatically Combining Categories');
createAutomaticallyCombinedCategoryVariables('Pattern (CHAID)', options = {allowed_merges: 'Any categories'});

See also