# Create New Variables - Automatically Combine Categories - By Pattern (CHAID) - Adjacent Categories

This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable. For instance, you may have a variable that contains categories for the occupation of respondents in a survey, and you may want to group those occupations based on those with similar income distributions or age distributions.

The CHAID algorithm is used to obtain the solution and only considers adjacent categories as possible combined categories. The algorithm also supports identifying some categories to be handled differently. The ones identified as unordered are free to combine with any other category.

## Example

Two examples are considered below that demonstrate the algorithm with only adjacent categories allowed and the other with two categories identified as unordered and free to combine with any other category. Both examples use an input variable that measures the size of Family income and uses the Education level of the respondent to deduce the pattern for the vombined categories in CHAID.

### Example 1 (Purely ordered categories)

Consider the following table which shows categories of Family incomes in the columns, and the Education level obtained in the rows:

We may want to create new categories by combining those Family Income categories which have the most similar pattern of Education level. Applying the Automatically Combined Categories - By Pattern (CHAID) - Adjacent Categories feature results in combined categories as shown in the new table:

We see that the Family Incomes are combined into two compound categories. The first compound category combines all incomes to a range up to $50,000 as those that tend to have a higher proportion of families with school or college level education and a lower proportion of tertiary education. The second compound category combines all the higher earning incomes with at least$50k as those with a higher proportion of tertiary level education at a university.

### Example 2 (Ordered categories with two unordered categories)

This example uses the input variable called Family income with unordered categories. It is very similar to the Family income variable above but has two extra categories, "Don't know" and "I refuse to answer this question" which a respondent may which to choose if they don't know their income precisely or don't want to communicate it. These two categories are not suitable for the income scale and it is possible to allow them the freedom to combine with any other category in the Family income scale.

The Automatically Combine Categories - By Pattern (CHAID) - Adjacent Categories feature can handle this situation. Categories that are identified as unordered are allowed to combine with any other category to form the final compound categories resulting in the new table below:

We see that the resulting combined categories are similar to the ones in Example 1 where it reduces to two compound categories that contain the incomes split at a boundary of \$50,000 based of the higher or lower proportion of individuals that complete tertiary education. The addition here is that the unordered categories of "Don't know" and "I refuse to answer this question" are free to combine with any other category. In this case, the respondents that didn't know or refuse to respond seem to have a similar pattern to those with lower proportion of tertiary education and lower incomes.

## Usage

1. In the Variables and Questions tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Pick One or Pick One - Multi questions.
2. Select Automate > Browse Online Library > Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories.
3. To change how the categories are combined:
1. Select the new variable or question in the Variables and Questions tab.
2. Right-click and select Edit R Variable.
3. Choose the desired options in the Inputs section on the right.
4. Click Update R Variable.
1. Under Data Sets tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Nominal / Ordinal, or Nominal / Ordinal - Multi variable sets.
2. Click the + symbol to the right of the selected variables.
4. Choose options for the new variable in the Inputs section on the right of the screen.

## Options

Variable This is the variable whose categories you wish to combine.

Combine by Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to By Value if combining numeric data, or By Geography if your data contains geographic locations (zip codes, states, cities, etc).

Based on This is the variable that you want to compare with the first Variable above. The categories of the variable selected in Variable will be combined based on the similarity of their distributions of this Based on variable.

Weight Select a weight variable here if you wish to apply the weighted version of CHAID. This will combined categories based on the weighted distributions.

CHAID ALGORITHM SETTINGS

Combine The option to choose which pairs of categories are permissible to combine. The options are:

Any categories: It is permissible for each category to combine with any other category.
Adjacent categories: It is only permissible for each category to combine with adjacent categories. Unless one or more categories are specified in the Unordered categories control. In that case, the categories specified in that control are permitted to combine with any other category and not restricted to adjacent categories.
Adjacent categories unless missing value code: The same behaviour as Adjacent categories except if there are any categories which are coded with a value of NaN in the Value Atrributesset to Include in percentages (but not averages) in DATA VALUES > Values. Then those categories will always be considered as unordered.
Using variable set structure: The permissible combine options are determined by the Variable typeVariable Set Structure of the input variable. If the input variable is CategoricalNominal then Any categories are permissible to combine. If the input variable is Ordered CategoricalOrdinal then Adjacent categories are permissible to combine.

Unordered categories This control only appears if the Combine option is Adjacent categories or Adjacent categories unless missing value code. This control gives the ability to specify if particular categories should be considered as not ordered and allowed to combine with any other category. Other categories not entered here can only combine with adjacent ones. This is appropriate if the input variable contains ordinal values on a scale but some options are not ordered. For example if the categories are on a scale with options, 'Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree', "Don't know" and 'I refuse to answer' then "Don't know" and 'I refuse to answer' can be identified and then permitted to combine with any of the other categories. The two category labels would need to be typed into this control and separated with a ';' or ','. Note that if Combine is set to Adjacent categories unless missing value code, then any category which is coded with a value of NaN in the Value Atrributesset to Include in percentages (but not averages) in DATA VALUES > Values will always be considered as unordered.

Use Exhaustive CHAID This controls whether the Exhaustive CHAID algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is Usually, which means that Exhaustive CHAID will always be used unless your Variable has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case you can ensure the exhaustive algorithm is applied by changing this setting to Yes.

Minimum category size The CHAID algorithm will not produce new categories which have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.

Alpha level to combine categories This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change Use Exhaustive CHAID to No).

Alpha level to validate final combined categories This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the Based on menu which has a greater level of variation with the main Variable, or you can increase the value of this setting.

Multiple Comparison adjustment This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the Alpha level to validate final combined categories. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.

## Technical details

CHAID stands for Chi-square automatic interaction detection. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)[1] and Biggs, D., Ville, B., and Suen, E. (1991)[2] for more details.

The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.

The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of the those solutions by identifying the solution with the smallest p-value.

When weights are used, the second order survey weight adjusted test of independence of Rao and Scott (1984)[3] is used instead of the standard Pearson Chi squared test.

If the Multiple comparison adjustment option is selected, then the significance test to assess the significance of the final state of the combined categories from the CHAID algorithm is adjusted by the number of pairwise tests conducted during the combining of each category. This adjustment is a Bonferroni type adjustment that is computed differently for the standard CHAID algorithm against the exhaustive CHAID algorithm. The standard algorithm terminates if there are no pairwise tests that are above the significance level. While the exhaustive algorithm will combine a category with another category until only two categories remain. From the set of states generated in the exhaustive algorithm, the state with the smallest p-value is considered the optimal configuration and becomes the final combined category solution.

In the sections below, details about the the Bonferroni adjustments used for both the standard CHAID algorithm and the exhaustive CHAID algorithm. In each section, the detailed adjustments for the Combine option allowing Any categories to combine or only Adjacent categories are given. The latter also considers a more refined adjustment when some Unordered categories are specified in the Adjacent categories option in Combine. More possibilities are explored and therefore a larger Bonferroni adjustment is required when some categories are allowed to combine with any other category in the Adjacent categories option. Define the initial number of categories in the variable as $\displaystyle{ c }$ and the final number of reduced categories from the combined solution as $\displaystyle{ r }$. Then the Bonferroni adjustment is denoted $\displaystyle{ B(c,r) }$ for each of the possible scenarios below (assuming of course that $\displaystyle{ 1 \le r \le c }$ are integer valued).

#### Standard algorithm

The standard algorithm follows the Bonferroni adjustment approach used in Kass (1980)[1]. Here the adjustment considers the number of possible arragements from reducing $\displaystyle{ c }$ categories into $\displaystyle{ r }$ categories. In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control. Then this is solved by a result of partitions. In particular, Stirling numbers of the second kind [4] gives the number of ways to partition a set of $\displaystyle{ c }$ categories into $\displaystyle{ r }$ non-empty subsets as $\displaystyle{ \left\{ \begin{smallmatrix} c\\ r \end{smallmatrix} \right\} }$ and takes the role of the Bonferroni adjustment value for the case of Any categories. In particular,

$\displaystyle{ B(c,r) = \left\{ \begin{matrix} c\\ r \end{matrix} \right\} = \frac{1}{r!}\sum_{i = 0}^r (-1)^i\binom{r}{i}(r - i)^c, \qquad \left\{ \begin{matrix} c\\ c \end{matrix} \right\} = 1, \quad \text{ and for } c \ge 1, \quad \left\{ \begin{matrix} c\\ 1 \end{matrix} \right\} = 1. }$

For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by,

$\displaystyle{ B(c,r) = \binom{c - 1}{r - 1}. }$

In the case when there Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are $\displaystyle{ u }$ unordered categories (including categories coded as missing), with $\displaystyle{ 1\le u \lt c }$ then the Bonferroni adjustment is,

$\displaystyle{ B(c, r, u) = \sum_{s = 0}^u \binom{c - u - 1}{r - s - 1}\sum_{i = 0}^{u-s}\binom{u}{i}\left\{ \begin{matrix} u - i\\ s \end{matrix} \right\} (r - s)^i }$

#### Exhaustive algorithm

The exhaustive algorithm follows the Bonferroni adjustment approach used in Biggs, D., Ville, B., and Suen, E. (1991)[2]. Here the adjustment considers the number of tests conducted as the algorithm traverses from the full set of $\displaystyle{ c }$ categories down to two categories.

In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control, the Bonferroni adjustment is,

$\displaystyle{ B(c,r) = \sum_{k = 2}^c \binom{k}{2} }$

For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by,

$\displaystyle{ B(c,r) = \binom{c}{2}. }$

In the case when there Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are $\displaystyle{ u }$ unordered categories (including categories coded as missing), with $\displaystyle{ 1\le u \lt c }$ then the Bonferroni adjustment is,

$\displaystyle{ B(c, r, u) = \binom{c - u}{2} + \sum_{i = 0}^{u - 1}\frac{c - i}{2} \left( 2 c - u - 1 - i\right). }$

### Differences to SPSS CHAID

When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.

In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a tie. This algorithm will attept to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS have not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.

## References

1. Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 20, 2, 119-127. doi: https://doi.org/10.2307/2986296
2. Biggs, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62. doi:https://doi.org/10.1080/02664769100000005
3. Rao, J. N. K. and A. J. Scott (1984). 'On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.' The Annals of Statistics, 12, 1, 46-60. doi: https://doi.org/10.1214/aos/1176346391
4. Stirling Numbers of the second kind (2022). Retrieved June 9, 2022, from https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind

## How to apply this QScript

• Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
• Click on the QScript when it appears in the QScripts and Rules section of the search results.

OR

• Select Automate > Browse Online Library.
• Select this QScript from the list.

## Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

• Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
• Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
• Press Edit a Copy (bottom-left corner of the preview).
• Modify the JavaScript (see QScripts for more detail on this).
• Either:
• Run the QScript, by pressing the blue triangle button.
• Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.

### Customizing QScripts in older versions

includeWeb('QScript Functions for Automatically Combining Categories');