# Create New Variables - Automatically Combine Categories - By Pattern (CHAID) - Adjacent Categories Unless Missing Value Code

This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable.

The CHAID algorithm is used to obtain the solution and only considers adjacent categories as possible combined categories unless the category as been coded as missing. The occurs when the category is coded with a value of NaN in the **Value Atrributes** The ones coded as missing are free to combine with any other category.

## Example

Consider the following table which shows categories of Age in the columns, and the Population density obtained in the rows:

We may want to create new categories by combining those Age categories which have the most similar pattern of Population density. Also, we may wish to have the older demographic of respondents aged 65 or older to be coded as missing values. Applying the **Automatically Combined Categories - By Pattern (CHAID) - Adjacent Categories Unless Missing Value Code** feature results in combined categories as shown in the new table:

We see that the Age categories are combined into four compound categories. The first being the younger demographic of those younger than 35 years old which tend to have a lower representation in cities with between 100,000 to a million residents. The two cohorts of 45-54 years and 55-64 years old are distinct enough that they don't combine since they have a different pattern (distribution of values) based off the population density. However, the 65 years and older cohort which had the missing value code had a similar pattern in the population density to the much younger 35 to 44 year old cohort. Since they had a missing value code, this older cohort of 65+ was allowed to combine with the younger 35 to 44 year old cohort.

## Usage

- In the
**Variables and Questions**tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Pick One or Pick One - Multi questions. - Select
**Automate > Browse Online Library > Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories**. - To change how the categories are combined:
- Select the new variable or question in the
**Variables and Questions**tab. - Right-click and select
**Edit R Variable**. - Choose the desired options in the
**Inputs**section on the right. - Click
**Update R Variable**.

- Select the new variable or question in the

## Options

**Variable** This is the variable whose categories you wish to combine.

**Combine by** Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to **By Value** if combining numeric data, or **By Geography** if your data contains geographic locations (zip codes, states, cities, etc).

**Based on** This is the variable that you want to compare with the first **Variable** above. The categories of the variable selected in **Variable** will be combined based on the similarity of their distributions of this **Based on** variable.

**Weight** Select a weight variable here if you wish to apply the weighted version of CHAID. This will combined categories based on the weighted distributions.

**CHAID ALGORITHM SETTINGS**

**Combine** The option to choose which pairs of categories are permissible to combine. The options are:

**Any categories**: It is permissible for each category to combine with any other category.**Adjacent categories**: It is only permissible for each category to combine with adjacent categories. Unless one or more categories are specified in the**Unordered categories**control. In that case, the categories specified in that control are permitted to combine with any other category and not restricted to adjacent categories.**Adjacent categories unless missing value code**: The same behaviour as**Adjacent categories**except if there are any categories which are coded with a value of NaN in the**Value Atrributes**. Then those categories will always be considered as unordered.**Using variable set structure**: The permissible combine options are determined by the Variable type of the input variable. If the input variable is Categorical then**Any categories**are permissible to combine. If the input variable is Ordered Categorical then**Adjacent categories**are permissible to combine.

**Unordered categories** This control only appears if the **Combine** option is **Adjacent categories** or **Adjacent categories unless missing value code**. This control gives the ability to specify if particular categories should be considered as not ordered and allowed to combine with any other category. Other categories not entered here can only combine with adjacent ones. This is appropriate if the input variable contains ordinal values on a scale but some options are not ordered. For example if the categories are on a scale with options, 'Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree', "Don't know" and 'I refuse to answer' then "Don't know" and 'I refuse to answer' can be identified and then permitted to combine with any of the other categories. The two category labels would need to be typed into this control and separated with a ';' or ','. Note that if **Combine** is set to **Adjacent categories unless missing value code**, then any category which is coded with a value of NaN in the **Value Atrributes** will always be considered as unordered.

**Use Exhaustive CHAID** This controls whether the **Exhaustive CHAID** algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is **Usually**, which means that Exhaustive CHAID will always be used unless your **Variable** has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case you can ensure the exhaustive algorithm is applied by changing this setting to **Yes**.

**Minimum category size** The CHAID algorithm will not produce new categories which have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.

**Alpha level to combine categories** This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change **Use Exhaustive CHAID** to **No**).

**Alpha level to validate final combined categories** This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the **Based on** menu which has a greater level of variation with the main **Variable**, or you can increase the value of this setting.

**Multiple Comparison adjustment** This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the **Alpha level to validate final combined categories**. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.

## Technical details

CHAID stands for *Chi-square automatic interaction detection*. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)^{[1]} and Biggs, D., Ville, B., and Suen, E. (1991)^{[2]} for more details.

The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.

The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of the those solutions by identifying the solution with the smallest p-value.

When weights are used, the second order survey weight adjusted test of independence of Rao and Scott (1984)^{[3]} is used instead of the standard Pearson Chi squared test.

### Bonferroni adjustments

If the **Multiple comparison adjustment** option is selected, then the significance test to assess the significance of the final state of the combined categories from the CHAID algorithm is adjusted by the number of pairwise tests conducted during the combining of each category. This adjustment is a Bonferroni type adjustment that is computed differently for the standard CHAID algorithm against the exhaustive CHAID algorithm. The standard algorithm terminates if there are no pairwise tests that are above the significance level. While the exhaustive algorithm will combine a category with another category until only two categories remain. From the set of states generated in the exhaustive algorithm, the state with the smallest p-value is considered the optimal configuration and becomes the final combined category solution.

In the sections below, details about the the Bonferroni adjustments used for both the standard CHAID algorithm and the exhaustive CHAID algorithm. In each section, the detailed adjustments for the **Combine** option allowing **Any categories** to combine or only **Adjacent categories** are given. The latter also considers a more refined adjustment when some **Unordered categories** are specified in the **Adjacent categories** option in **Combine**. More possibilities are explored and therefore a larger Bonferroni adjustment is required when some categories are allowed to combine with any other category in the **Adjacent categories** option. Define the initial number of categories in the variable as [math]\displaystyle{ c }[/math] and the final number of reduced categories from the combined solution as [math]\displaystyle{ r }[/math]. Then the Bonferroni adjustment is denoted [math]\displaystyle{ B(c,r) }[/math] for each of the possible scenarios below (assuming of course that [math]\displaystyle{ 1 \le r \le c }[/math] are integer valued).

#### Standard algorithm

The standard algorithm follows the Bonferroni adjustment approach used in Kass (1980)^{[1]}. Here the adjustment considers the number of possible arragements from reducing [math]\displaystyle{ c }[/math] categories into [math]\displaystyle{ r }[/math] categories. In the case of all categories being allowed to combine with any other category (**Any categories** selected in the **Combine**) control. Then this is solved by a result of partitions. In particular, Stirling numbers of the second kind ^{[4]} gives the number of ways to partition a set of [math]\displaystyle{ c }[/math] categories into [math]\displaystyle{ r }[/math] non-empty subsets as [math]\displaystyle{ \left\{ \begin{smallmatrix} c\\ r \end{smallmatrix} \right\} }[/math] and takes the role of the Bonferroni adjustment value for the case of **Any categories**. In particular,

[math]\displaystyle{ B(c,r) = \left\{ \begin{matrix} c\\ r \end{matrix} \right\} = \frac{1}{r!}\sum_{i = 0}^r (-1)^i\binom{r}{i}(r - i)^c, \qquad \left\{ \begin{matrix} c\\ c \end{matrix} \right\} = 1, \quad \text{ and for } c \ge 1, \quad \left\{ \begin{matrix} c\\ 1 \end{matrix} \right\} = 1. }[/math]

For the case of purely adjacent categories being permissible to combine. That is **Adjacent categories** is selected in **Combine** and there are no **Unordered categories** specified and no missing values have been coded. Then, the adjustment is given by,

[math]\displaystyle{ B(c,r) = \binom{c - 1}{r - 1}. }[/math]

In the case when there **Unordered categories** specified when an **Adjacent categories** combine option is selected (and/or missing values coded in the case of **Adjacent categories unless missing value code**), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are [math]\displaystyle{ u }[/math] unordered categories (including categories coded as missing), with [math]\displaystyle{ 1\le u \lt c }[/math] then the Bonferroni adjustment is,

[math]\displaystyle{ B(c, r, u) = \sum_{s = 0}^u \binom{c - u - 1}{r - s - 1}\sum_{i = 0}^{u-s}\binom{u}{i}\left\{ \begin{matrix} u - i\\ s \end{matrix} \right\} (r - s)^i }[/math]

#### Exhaustive algorithm

The exhaustive algorithm follows the Bonferroni adjustment approach used in Biggs, D., Ville, B., and Suen, E. (1991)^{[2]}. Here the adjustment considers the number of tests conducted as the algorithm traverses from the full set of [math]\displaystyle{ c }[/math] categories down to two categories.

In the case of all categories being allowed to combine with any other category (**Any categories** selected in the **Combine**) control, the Bonferroni adjustment is,

[math]\displaystyle{ B(c,r) = \sum_{k = 2}^c \binom{k}{2} }[/math]

For the case of purely adjacent categories being permissible to combine. That is **Adjacent categories** is selected in **Combine** and there are no **Unordered categories** specified and no missing values have been coded. Then, the adjustment is given by,

[math]\displaystyle{ B(c,r) = \binom{c}{2}. }[/math]

In the case when there **Unordered categories** specified when an **Adjacent categories** combine option is selected (and/or missing values coded in the case of **Adjacent categories unless missing value code**), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are [math]\displaystyle{ u }[/math] unordered categories (including categories coded as missing), with [math]\displaystyle{ 1\le u \lt c }[/math] then the Bonferroni adjustment is,

[math]\displaystyle{ B(c, r, u) = \binom{c - u}{2} + \sum_{i = 0}^{u - 1}\frac{c - i}{2} \left( 2 c - u - 1 - i\right). }[/math]

### Differences to SPSS CHAID

When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.

In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a *tie*. This algorithm will attept to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS have not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.

## References

- ↑
^{1.0}^{1.1}Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 20, 2, 119-127. doi: https://doi.org/10.2307/2986296 - ↑
^{2.0}^{2.1}Biggs, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62. doi:https://doi.org/10.1080/02664769100000005 - ↑ Rao, J. N. K. and A. J. Scott (1984). 'On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.' The Annals of Statistics, 12, 1, 46-60. doi: https://doi.org/10.1214/aos/1176346391
- ↑ Stirling Numbers of the second kind (2022). Retrieved June 9, 2022, from https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind

## How to apply this QScript

- Start typing the name of the QScript into the
**Search features and data**box in the top right of the Q window. - Click on the QScript when it appears in the
**QScripts and Rules**section of the search results.

*OR*

- Select
**Automate > Browse Online Library**. - Select this QScript from the list.

## Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

### Customizing QScripts in Q4.11 and more recent versions

- Start typing the name of the QScript into the
**Search features and data**box in the top right of the Q window. - Hover your mouse over the QScript when it appears in the
**QScripts and Rules**section of the search results. - Press
**Edit a Copy**(bottom-left corner of the preview). - Modify the JavaScript (see QScripts for more detail on this).
- Either:
- Run the QScript, by pressing the blue triangle button.
- Save the QScript and run it at a later time, using
**Automate > Run QScript (Macro) from File**.

### Customizing QScripts in older versions

## JavaScript

```
includeWeb('QScript Functions for Automatically Combining Categories');
createAutomaticallyCombinedCategoryVariables('Pattern (CHAID)', options = {allowed_merges: 'Adjacent categories unless missing value code'});
```

## See also

- QScript for more general information about QScripts.
- QScript Examples Library for other examples.
- Online JavaScript Libraries for the libraries of functions that can be used when writing QScripts.
- QScript Reference for information about how QScript can manipulate the different elements of a project.
- JavaScript for information about the JavaScript programming language.
- Table JavaScript and Plot JavaScript for tools for using JavaScript to modify the appearance of tables and charts.

Q Technical Reference

Q Technical Reference

Q Technical Reference

Q Technical Reference > Setting Up Data > Creating New Variables

Q Technical Reference > Setting Up Data > Creating New Variables

Q Technical Reference > Updating and Automation > Automation Online Library

Q Technical Reference > Updating and Automation > JavaScript > QScript > QScript Examples Library > QScript Online Library